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I. INTRODUCTION 


The English language is the most popular scientific 
language used today. The language descended from Latin and 
has had wide use in the scientific field. The English 
alphabet is familiar to people in Europe and all countries 
who use languages descended from Latin. There are slight 
changes between the various alphabets that have descended 
from Latin. 

The wide use of Latin alphabets has made it easy to set 
Standards for typewriters and console keyboards. The 
Similarity in grammar common to most of them, their fonts 
and direction of flow (i.e., left to right) has made it easy 
to standardize. 

Keep in mind that, many of the computer pioneers made an 
effort not to limit the implementation of their software to 
one spoken language. Software is the key to any limited use 
of computers in any language. Typically lack of knowledge 
of programmers in a foreign language limits their ability to 
write applications acceptable to the user. Not so many 
nations are blessed with the computer development 
technology. However all nations have people who, as users, 
are capable oof contributing to humanity using this 


technology. 


Given the technology existing today, if we can create an 
interface between a host foreign language and a target 
application language there will be fewer barriers to nations 
that do not use a standard English, French, or German-based 
computer operating systems and software. The interface will 
accept user commands from the host environment and translate 
it to the syntax of the target environment. It is assumed 
that the user is knowledgeable in the semantics of the. 
target environment in his spoken language terms. 

The question may be asked, "what good will this approach 
do such a nation?" There are several good points. Two of 
the most important reasons--One, there is a good library of 
software that exists; and two, the price of software (even 
with the addition of an interface communicator) is less than 
newly-written customized software. It is faster and easier 
to write an interface than to rewrite a large body of 
software. 

Two user environments should not be confused. The 
customized foreign alphabets used in many countries on 
mainframes for specific applications are developed by 
contractors who are expert in that application but not 
necessarily the foreign language. The mainframes must use 
the software provided by the original contractors. It takes 
a lot of effort and capital to develop new software 
application for the special machine. This limits the use of 


the computer to operators and data entry personnel with 


minimum creative programming from the user side. Users do 
not share the expertise’ of others and the continuously 
improving software. This is because there are limited users 
and minimum feedback to software developers. 

The second user environment a oe average user who has 


some scientific background but has no access nor the capital 


to invest in mainframe hardware. This user is often an 
educator, student, or a professional. This category of 
users has great potential. The use of software with a 


native language interface would be very helpful and afforda- 
ble at the same time to this group. This group is very 
capable of contributing in their respective fields with the 
powerful processing features available with personnel 
computer technology today. 

This thesis is concerned with the second user environ- 
ment for several reasons. The second group of users are the 
creative ones. Their understanding of computers and its 
applications is a major step toward building the target 
machine with compatible native standards. This will elimin- 
ate the ad hoc design by the contractor who most of the time 
has to hire a non-technical translator and dictate to them 
the language specification, key words, and commands of the 
operating system, or query language. Usually a translator 
will translate the machine native language key words to the 
target language using its alphabet. The translator may have 


minimal programming or computer experience. This will most 


likely lead to an ambiguous environment for users to work 
with. 

The feasibility of such an approach is constrained by 
several factors. The language or the user environment is 
one factor. How is the language implemented or emulated on 
standard Latin language hardware? The target machine (i.e., 
micro to mini computers) compatibility with others in the 
same family is also a factor. These are factors that affect. 
feasibility. Economical feasibility is based on demand and 
supply and a developer must evaluate the benefit vs. the 
development cost in order to develop such interface 
software. 

The Arabic language is a very rich language in vocabu- 
lary and historical background. The Arabian alphabet is 
very old. The language was used for several centuries by 
leading ancient mathematicians, physicians, biologists, and 
chemists. They successfully contributed in their fields 
using the Arabic alphabet. Their numerals, symbols, and 
equations were all written in Arabic. However this does not 
make it simple to use the Arabic alphabet in the modern 
computer environment. 

One reason is that the direction of flow in reading and 
writing is from right to left. Secondly, Arabic characters 
are not printed like Latin characters. Arabic words are 
printed like calligraphy. Arabic characters must be either 


written in stand alone or connected form. The character may 


be located in one of three ways: at the beginning of a 
word, in the middle, or at the end of a word. With a set of 
complicated rules the shape of a character is determined by 
its location with respect to the word. This difficulty has 
complicated attempts to provide a software emulation to the 
Arabic environment in personal computers. 

The goal of this thesis is to provide an approach to 
solving this problem. The steps that must be followed will 
be described in addition to special consideration. To show 
that translation is possible, we will develop an interface 
to communicate between an Arabic form of source code in the 
PASCAL language and an existing English PASCAL compiler. 
The interface will use sample source code written in Arabic 
and Lexically Translate it to English source code. The goal 
is, given correct Arabic source code, the interface will 
produce correct English source code. This should be done 
once. Once the program is compiled the interface step is no 


longer needed with the compilation. 
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II. BACKGROUND ON ARABIC CHARACTER 


A. INTRODUCTION 

There are 28 basic characters in the Arabic alphabet 
ebigure 1). However, these basic characters are not 
sufficient for use with computers or typewriters. 
Authorities agree {Ref. 1] that the optimum set should use a 
minimum of 31 characters (Figure 2), three more characters 
than the original set. The additional 3 characters are 
needed to constitute the optimum set for representing Arabic 
texts. One may check the Kufic script, which is over 1500 


years old, to realize that engravings by ancient Arabs were 


done with close to 31 characters. Each character has one 
shape. Over the years, variations of the characters have 
developed for ease of writing and reading. Each character 


may have from two to five shapes depending on its location 


Within a word. All applications must use these variations 
as standards to represent Arabic texts. Implementing the 
variation is critical for compatibility issues. Code 


representation of any variation must follow a= strict 
standard to insure survival among other implementations. 

The Arabic alphabet has only three vowels in the 28 
characters (see Figure 3 for the alphabet names). Voweli- 
zation is also performed through the use of diacritics (see 


Figure 4). Most Arabic texts do not show diacritics. 


Il 


Readers have learned to read and understand the word based 
on the context of its. use. If misinterpretation is 
ee nae verifications are provided in parentheses. Most 
applications today do not require diacritic symbols. 

The Arabic numerals and Hindu are used in the Arabic 
world. North African countries use the Arabic numerals (as 
used in Latin). The Arabic name is given to the numerals 
used in Latin, and Hindu numerals are used by most of the 
Arabic world (Figure 5). However, history books show that 
both systems originated in India. The Arabic language uses 
the Latin comma for a decimal digit to be distinguished from 


the Arabic number zero which is the Latin decimal digit ".". 


B. ARABIC LANGUAGE 
The Arabic language differs from languages descended 
from Latin in several ways. The primary differences are: 


* Arabic is written right to left instead of left to 
right: 


* The representation of vowels by using diacritics in the 
form of over or under scores with most letters within 
the words. 

Secondary differences are: 

* Letters in Arabic may be joined or not according to 
location within the word. A particular ietter may be 
joined to the preceding letter, and/or following letter. 


* Each letter has between two and five different forms 
dependent on its contextual position. 


Lexically the Arabic language can be defined in BNF 


notation as follows [Ref. l:p. 28]: 


eZ 


<language> = (<sentence>}; 

<sentence> ::= {<word> " 

<word> 23 (<characters><voc.sym><character>}, 
<character>::= { see Figure 1. y 

<voc.sym> ::= { see Figure 4. sy 


C. WRITING ARABIC 

Writing in Arabic flows from right to left, additional 
lines start from right to left beginning below the previous 
line. A word is entered by typing the first character at 
the cursor position followed (to the left) by the next 
character. An example of this is the word "hello." If the 


same word is entered in Arabic it will be entered as 


follows: 
CUrSOY POSItiON ----- HHH He HH rr rr rr rere us 
Seep 1. enter character "Nh" ~--~-<---~-<<<<<<<----------- _h< 
eo pect cliver Characcer "Ye" ——-—<<<<<{-<--<-<------------ _eh< 
step 3. enter character "]" ----c-cern nnn nnn r reer ---- _ileh< 
step 4. enter character "1" -----------<-<-<-------- _lieh< 
Step 5. enter character "0" ]--j--<-<<-<--<<---<-<<=---- _olleh< 


This demonstrates the direction of flow, however if one 
should worry about each character shape, it may seem tedious 
for long text. In some applications one must provide dia- 
critics also. In short, typing one vocalized word seems 
like a puzzle. 

There are rules governing the shape (form) of the letter 


based on its contextual position. 


iS 


Dewachi, Abdulilah [Ref. l:p. 27] has the following 

opinion on the ules: 
These rules have, in my opinion, been developed for ease 
of handwriting and have no bearing on the semantic and/or 
syntactic requirement of the language. 
In spite of the cause or the reason for the development of 
the rules, all books, newspapers, and magazines in the Arab 
countries today are written using those rules. They will 
also stay that way for years to come. 

Arabic letters are cursive in shape. The implementation 
of the alphabets is highly dependent on how legitimate the 
characters look. The cursive nature of characters requires 
that both monitor and graphic adapter provide good resolu- 


c10ne High resolution is also required for supporting 


correct vocalization, as previously discussed. 


D. ARABIC NUMERALS 

Both the eastern Arabic numerals and the western Arabic 
numerals (Figure 5) are used. Countries like Algeria, 
Morocco and Tunisia use the western Arabic numerals. The 
numeral system is not a critical issue since in both repre- 
sentations they have the same value. 

Many people believe that the Arabs write the numbers 
from left to right. This is a misconception. The language 
books and schools teach the classical way of writing and 
reading the numerals. The classical way is to either use 
the words ("one","two",...) or the numbers Cram ge ly im 


writing starting from right to left. For example the number 
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523 will be written in Arabic as "three and twenty and five 


hundred." It may sound wrong in English composition but 
this is the syntax that classical books use. This method 
should be encouraged. This is also followed in reading the 
numbers. 


The most common method in handwriting numbers is to 
write in the order they are said. An example of how numbers 
are read and written today is the year 1986--pronounced as 
"One thousand nine hundred six and eighty. Notice the six 
comes before the eighty. Writing the number "1986) using 


numbers 1s done as follows: 


erst digit i en 
second digit 1922 
miird digit 19_6 
mourth digit 1986 


This method is far too complicated to_be adopted by mechani- 


cal machines. The classical method should be encouraged for 
another obvious reason. The numbers are entered least 
Significant bits first in low memory. From the computer 


hardware point of view the adders/subtractors may work on 
the number before the complete number is loaded [Ref. 1]. 
This is the more efficient way. Also both numbers and 
strings will be right justified. 

This chapter has outlined the major concerns and differ- 
ences between the Arabic and Latin alphabet. There are a 


few more things worth noticing. The opening brackets "[", 


es 


"(t= and "(" are the closing brackets in Arabic and vice 
versa. The Arabic question mark has the same look as "?" 
but rotated 180 degrees around its vertical center. A list 
of a complete code set including special characters is 


included in the ARCII code set (Appendix D). ARCII will be 


discussed in detail in later chapters. 
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III. CONTEXTUAL PROBLEMS IN ARABIC WORDING 


For any computer to work in Arabic it must also be able 
to handle English alphabets. Arabic users will pay a few 
extra dollars to add ene bilingual features in purchasing a 
computer. The form of the bilingual feature is a 
controversial issue. This chapter will show why one should. 
be concerned in uSing mixed mode or even alternative between 
the two alphabets--Latin and Arabic. 

There are three major differences between alphabets 
descended from Arabic and Latin. The differences are 
direction of flow, diacritics, and variant location shape of 
characters. These issues are specific to the language. 
This chapter will discuss these issues with respect to the 
computer environment. 

Each difference requires special attention in an Arabic 
alphabet implementation in hardware. The direction of flow 
in reading and writing is very complicated for users and 
developers alike. This is especially true where the 
keyboard, the display, and the printer are to operate in 
bilingual mode. Arabic is read and written in the opposite 
direction to Latin. The difficulty is when the user wants 
to flip to the other mode for another application, or within 
the same applications the user wishes to mix both character 


sets. 


eh 


A boom in the introduction of electronic computingmrs 
the Arabic world lead manufacturers to make short cuts to 
meet the complicated needs of the Arabic alphabet. Also the 
Arabic alphabet is used in several countries with non-Arabic 
languages. This wide use invited companies to quickly 
develop a Character set for Arabic, based on _ limited 
research. As a result important language needs such as 
diacritics were avoided. This also has lead to a delay in 
the realization of an effective solution. 

The contextual problems, that is, the variant shape of 
characters, 1s the most difficult. To establish a solution 
is to decide the style or the method that developers should 
follow in implementing Arabic character sets. The problem 
is the complexity of providing to the user all shapes possi- 
ble for the 28 character set on the Keyboard. Each charac- 
ter has between two to four shapes, making for a total 
requirement of 84 codes to represent the minimum set of the 
Arabic alphabet. This number is higher by 50 percent than 
what the English alphabet (upper and lower case) requires. 
The rest of the special characters and diacritics require 
more codes. In some cases the applications of diacritics to 
some charac-ters requires a unique shape to represent it. 
This requires a unique code for the combination of 
characters and diacritics. The use of "Hamzah"! also 

lThe "hamzah" is one of the three characters that were 


added to the alphabet in addition to the original character 
set. 


18 


requires special attention when used with any of the three 
vowels in the alphabet. The limited number of codes the 
keyboard has is the limiting factor for planning the code 
assignments. A look at some efforts and proposals will be 


discussed in the following chapter. 


eee ULRECTION OF FLOW 

Working in mixed mode is considered a must in the Arabic 
environment. There are two approaches to handle the mixed 
modes data entry and storage problem. One approach calls 
for the data to be stored in aural order (i.e., logical 
order). The second approach is to store the data in the 
Same order as it looks (i.e., visual order). Keep in mind 
that if an Arabic word is inserted in English text the last 
character of the word will be encountered first, scanning 
meom left to right. 

One approach places the burden on the display to 
translate the incoming data to the correct direction to be 
displayed. The display must translate an escape code or a 
mode bit sent with the data. The easiest method is to set a 
high bit (if it is not used) as to whether the character is 
Beanic or Latin. This option calls for smart display 
devices. 

The second approach is to store the data in aural order. 
This approach places the burden on the computer to determine 
how to store data to cause no shifting of display direction. 


This means the display program will keep track of the 


Lg 


language mode and do order reversing to store the data in an 
appropriate order. In handwriting, handling mixed modes is 
done in the following fashion: 

- continue typing until reaching a foreign character. 


- count the number of spaces occupied by foreign 
characters up to the first native character. 


- skip that number of spaces and write back to where you 
stopped before skipping. When done the writer should 
end where he/she jumped from. 


- skip the same number of spaces you counted. This is 
where the next native character belongs. 


It seems that humans can do this routine more easily than 
computers. The computer can only deal with incoming data as 
it arrives, one character at a time. This means the 


computer does not know in advance how many foreign charac- 


ters are coming. The computer can use a logical device 
called a stack. Characters of different mode are stored 
(pushed on the stack) up to the next native character. At 


this point the computer has the foreign string in reverse 
order on the stack. In the next step the computer starts to 
write from the top of the stack until no more characters are 
in the stack. Then the program continues with the last 
encountered native character. In this approach the 
direction of flow for the display is maintained. Obviously 
this method has several disadvantages. One, it slows the 
storing of data in mixed mode. Two, it slows the computer 


from doing other functions, where a smart display could 
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handle the display of mixed mode data as they are stored 
begically. 

The approach that should be taken is connected with 
resolving the contextual issue, the variant character shape 


problem. 


B. ARE DIACRITICS REQUIRED? 

By linguistic standards the omission of diacritics by 
computers murders the Arabic language. Linguists have 
mimays Officially criticized the mispronunciation of 
statements by television and radio people. The use of dia- 
critics is a must in the language even by recommendation of 


westerners involved with the Arabic alphabet [Ref. 1: pp. 


39-46]. 

In a previous chapter diacritics were discussed. There 
are five basic diacritics. The five are (Figure 3) from 
right to left: "Fat_ha", "Dammah", "Kassrah", "Sukoon", and 
"Shadah". The first three can be doubled, in the same 
manner as double quotes in Latin. When any diacritic is 


doubled it is known as "Tanween" and adds an N sound to the 
character. The Shaddah has the same effect as doubling the 
consonant in English. It can be used inconjunction with any 
of the first three or their "Tanween." The Sukoon, when 
used, means that the character must be read in primitive 
form, versus using previous diacritics. 

An example of one word using different diacritics will 


show how the sound and subsequently the meaning changes. 
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The word pronounced "tilmeeth" in Arabic means a student. 
The "th" at the end is the character "Thal" in Arabic. The 
example will show the different sounds per word when only 


the last character has different diacritics. 


WORD VOWELIZATION PRONOUNCED 
TILMEETH "PAT HA" TILMEETHA 
TILMEETH "KASRAH" TILMEETHI 
TILMEETH " DAMMAH" TILMEETHO 
TILMEETH "SUKOON" TILMEETH 


Using the "Tanween" effect with the first three diacritics, 


the same word iS pronounced as follows: 


- with "Fat_ha tanween" TELMEETHAN 
- with "Kasrah tanween" TELMEETHIN 
- with "Dammah tanween" TELMEETHON 


Shaddah has the ability to be used with all the above except 
the Sukoon. 

The use of diacritics removes the ambiguity in the 
reading of text. It 1S powerful enough to change the 
meaning of the sentence completely. The vowelization of 
verbs by diacritics will change the sentence to passive 
tense. In Arabic the verb comes before the noun. So) Jia 
Arabic the two statements, 'was stolen Ali a book,' and 
'stole Ali a book' without the use of diacritics, especially 
on the verb, could not be distinguished. The effect of the 
"er" and "ee" in English as in "“employer/employee" is also 


achieved by the use of diacritics in Arabic on the noun. In 
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combination, the failure to use diacritics can completely 
obscure the meaning of a sentence. For example, it would be 
as if in the sequenced fired/was_fired employee/er we did 
not know which of each alternative is meant. The employee 
either was fired, or fired someone. On the other hand, the 
employer either was fired, or fired someone. See Figure 6 
for some examples using vowels and without vowels. 

Sean eoOne can. See the need of diacritics. Iba, 
religious and history texts, they are used extensively. In 
an international symposium for standardization of character 
code sets and keyboards for Arabic language in computers 
held on 1-4 June 1980, several proposals were presented by 
researchers and companies that already have developed their 
own character sets [Refs. 1,2]. All the proposals and 
recommendations agreed on including the diacritics. This 
use of diacritics will be beneficial in the use of data 


bases, artificial intelligence and educational textbooks. 


C. THE CONTEXTUAL ISSUES 

The mere presence of a character in different locations 
within a word determines the shape to be written or read. 
Should the computer do the analysis and free the user from 
worrying about a large complex character set, or should the 
keyboard contain all possible variations of each character 
and have the user learn to master more than one hundred 
strokes for the alphabet in addition to numerals, special 


characters, and punctuation? 
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One popular approach is to provide only a minimum set of 
required characters, usually between 31 and 60 not including 
diacritics, numerals, and special characters. This approach 
is known as the single character single shape keyboard. The 
data is stored in memory or storage devices using this 
reduced code. The reduced code is analyzed by an interface 
to give the right form or shape. The interface is part of 
the display, when smart displays are used, or a shell on top 
of the "0.S." to contextually analyze the character form. 

The issue is not quite settled and standardized among 
all Arabic alphabet users, nor Arabic countries. A suc- 
cessful meeting of authorized people from all concerned 
countries have not yet, to my knowledge, agreed on a 
standard. A few companies who stepped into the market early 
have generated their own version of character code sets. 
Some companies have realized the gap between their early 
implementation and actual language needs. The gap was 
realized more when the use of the produce was not utilized 
in all the areas and aspects for which it was designed. 
Some companies have realized that the survival and popu- 
larity of their product depends on compatibility with at 
least the codes of a character's internal representation. 
Some companies went further by investing in research for an 
optimum solution. Language experts were hired and/or con- 


sulted by companies like IBM, TI, and WANG. The companies 
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fees cOllowing efforts for solutions and continuing further 
the research to achieve an effective solution. 

In resolving the multiple character shapes, most com- 
panies have tried some reduction of all possible codes to a 
single code using several philosophies. Texas Instrument 
has presented [Ref. 1] three approaches to reduce the Arabic 
code. 

The first approach was called "CORRESPONDENCE & DIFFER- 
ENCES." This approach divided the alphabet into groups. 
The first type A have characters with one, two, or three 
points (Appendix B). The second type B are without points. 
The last type C contains characters having at least one form 
of each, for example character "RA" and "ZA." The two char- 
acters have the same form with a point on the "RA" and no 
point on the "ZA." The idea is if the basic form has one 
key (code), two or more characters will have the same basic 
form, the points can be added later. 

The second approach was called "ROOTS & APPENDICES" 
(Appendix 8B). The approach divided the alphabet into 
groups. Two groups have six characters in each. Another 
group has four characters. Each of the above groups have 
the same cursive and "APPENDICES." The "ROOT" of the char- 
acter can be used at the beginning or in the middle of a 
word. One appendix will complement each root of a group. 
This will, require a total of seven codes for a group of six 


BOOTS. The group would require (for six characters, each 
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with three contextual forms) a total of 18 separate keys 


and/or codes. This approach implicitly asks for more 
software to analyze the appendices. A character may be 
represented by two codes internally. This will make text 


storage inefficient. 

The last approach was "CONTEXTUAL ANALYSIS" (Appendix 
Bir Texas Instruments has developed a product using this 
approach. The DS990 Bilingual System can handle 
Arabic/Latin modes and display them on a screen or line 
printer. The contextual analysis approach, in all the 
developments seen by the author, uses a reduced code set. 
The reduced code set is used for the internal representa- 
tion of data. Keyboard keys of the Arabic set are kept toa 
minimum, usually the basic form. A software interface 
analyzes the character contextually and displays the charac- 
ters in the right form. This interface software in some 
application is pushed further away from the responsibility 
of the CPU to the display terminals. Such terminals are 
called 'SMART' terminals. TI's DS990 system diagram (Appen- 
dix C) shows the configuration of a typical system. 

TI realized the need for diacritics in the Arabic 
language after it introduced the system to the marketplace. 
TI, at an international symposium held in Riyadh, Saudi 
Arabia between 1-4 June, 1980 [Ref. l:p. 68], in an effort 
at standardization of code, character sets, and keyboards, 


recommended that the Arabic computer systems standards 
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requirement include the use of diacritics. This iS an 
example of the approach of the pioneer companies who had to 
define and develop the alphabet codes set. Premature 
standards will automatically be overriden by the authorized 
agency. The DS990 did not handle the use of diacritics. 
Since the use of diacritics was adopted by all standards 
committees, this lead a few companies to follow a new 
standard that supports diacritics. 

ALIS, Inc., introduced BCON 7™ as a bilingual operating 
system. BCON was geared toward MS-DOS based microcomputers. 
The bilingual operating system is an interface between the 
operating system (0.S.) and different applications [Ref. 2]. 
This bilingual operating system adopted the single key or 
Single code approach. Each character is represented inter- 
nally in memory by a unique code. BCON also fully supports 
the use of diacritics in text. The single code approach, as 
mentioned before, requires that a device or an interface 
(hardware or software) properly analyze the character and 
display the correct form. BCON uses Application Screen 
Image Compensations (ASIC) to perform the contextual analy- 
Sis. BCON uses separate codes and fonts for each character. 
The internal character code gets translated (mapped) to its 
output code. The internal code has 4 to 5 output codes. 
The code to be displayed is based on the location of the 
character within the word (TI's and BCON's system will be 


covered in more detail in the next chapter). 


a 


IV. EFFORTS TO STANDARDIZE CODES 


Several nations use the Arabic alphabet today, both 
Arabic speaking nations and non-Arabic speaking. It is a 
political challenge to gather concerned nations and succeed 
in establishing a standardized code set acceptable to all of 
them. It is difficult for any one country to take the ini-- 
tiative and responsibility to follow such a program until it 
comes to life. It is hard for a single country to conduct 
research and share knowledge with another country that is 
thousands of miles away. In recent years as cooperation > 
between Arab nations has increased, and as methods of com- 
munication have improved, as well as travel, there have been 
more productive meetings and symposiums. Several countries 
have mutually cooperated to work and develop a possible 
solution to the standard codes set for Arabic in data 
processing. 

Many countries like Kuwait, Iraq, Morocco, and Saudi 
Arabia have hosted meetings and symposiums, listening to 
experts on the language, and in the data processing field. 
Researchers, aS well aS company representatives, have 
brought up points to consider, shared their experiences, and 
given recommendations. Several existing systems have been 
developed or proposed by companies or individuals in the 


field. The countries that have been exposed to technology 
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and are more developed than other Arabic nations, have an 
urgent need to set standards in general. Countries like 
Morocco started as early as the 1950's to set standards for 
printing devices. 

The north African countries have progressed further in 
this research. Morocco shared willingly with the Arab 
nations their latest research and developments in the area. 
The problem of choosing an existing system, with some or no. 
modification, or to redefine once again a new standard, is 


also a political issue. 


A. SOLUTION EFFORTS 

Several companies have provided results of their 
research and in some cases have implemented systems, giving 
recommendations and results of conducted tests, in the case 
of keyboard layout proposals. Companies that have an 
interest in the market and have worked in the Arabic data 
processing field, have no authority to develop a code set 
standard. Government representatives are the authorized 
agency to do such a task. Several companies have proceeded, 
given a lack of standards, to develop Arabic code sets and 
implement them on hardware. This has resulted in several 
incompatible systems of code sets. Data in one system means 
different things in another code set system. This approach 
to the development of code sets has both disadvantages and 


some advantages to the companies involved. 


29 


Early development made companies as well as users 
understand the weaknesses of the developed system. Foie 
example, TI's DS990 system's omission of diacritics failed 
to fulfill the needs of the language. On the other hand, by 
just introducing a product early, companies make their name 
familiar to customers. The customer cannot complain about a 
reasonable attempt. This did establish a good reputation 
for such companies, especially when they adopt the approved. 
standard and reintroduce their products. In additionsie 
developing a good name, they gain experience in the process. 
This will help in introducing an earlier product complying 
with the standards. So a company's early efforts are not a 
total waste. 

Since early implementation ignored including diacritics 
use with text, newer designs have to pay special attention 
to their use. Data base machines must pay attention when 
sorting and searching. The representation of diacritics 
will require special care from data processing machines. 
The priority of characters with or without diacritics must 
be known to the machine. A process of stripping diacritics 
from a given string to be located to match with a query, 
will facilitate the search. However, the target of the 
search, when found, must be displayed, and stored if 
updated, in the vocalized form. Unlike Texas Instruments, 
IBM chose to maintain domination in the market for type- 


writers and Arabic only EDP machines. IBM did conduct 
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studies on their own in an effort to develop a code set and 
keyboard layout. IBM, represented by Mr. R.P. Hajjar and 
Dr. A.M. Ismail, presented their attitude toward a bilingual 
code set standard at the symposium held in Riyadh, Saudi 
Arabia, in June 1980 [Ref. l:p. 72]: 
Meanwhile, competent people from the Arab world and from 
elsewhere, have addressed the same subject and came up 
with a variety of solutions that are not compatible with 
each other, due to the fact that they reflect the require- 
ments of a particular Arab country, but may not be totally. 
acceptable by the neighboring Arab country. This is the 
main reason why IBM has not implemented such solutions, 
but will look forward to investigate the possibilities of 
their implementation, in case these solutions are adopted 
as part of an inter-Arab standard. 
IBM, TI, and Wang have shared their research and willingness 
to achieve a solution and adopt it in their products. 
This chapter will briefly cover three systems: 
- TI DS990 System 
—wAaAtulS Inc., BCON System 


- ASV-CODAR Proposed System. 


B. TI DS990 BILINGUAL SYSTEM 

DS990 is a bilingual system that generates seven bits 
for ASCII codes and generates an 8 bit code for Arabic 
codes. The system represents the Arabic alphabet with 32 
unique codes in addition to 13 special characters. The 
thirty-two codes are the internal representations of the 
alphabet. TI's system uses the one key many shapes 
philosophy. The 32 codes are the basic character set of the 


system (Appendix C). The one key many shapes approach 
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requires the use of an interface with a smart display to 
display the correct form and shape. The DS990 block diagram 
(Appendix C), shows how the system is arranged. The 32 
codes are mapped to 128 less 13 giving a total of 115 shapes 
that can be displayed. The display ROM interface contains 
all 128 shapes (Appendix C). The display service routine 
(DSR) and the display ROM interface contextually analyze the 
basic code set and display the data correctly by mapping one. 
code to one or two display code(s). 


DS990 does not handle diacritics. It also increased the 


optimum set from 31 to 32 unique characters. The system 
considers LAM ALEF as a single character. Two clear viola- 
tions. The use of diacritics is a must in data processing. 


The LAMALEF (DC hex value in the basic character set) 
(Appendix C) is composed of the character LAM (D6 hex) 
followed by the character ALEF (CO hex) which are two 
separate characters and should not have a unique code. The 
fact that the table shows no special code for eastern Hindu 
numerals indicates that the same code for Arabic numerals, 


known as western Hindu, is used for both representations 


(Figure 5). Depending on the display mode, the eastern 
(Hindu) and the Arabic (western Hindu) are displayed 
differently. So a user of a north African country cannot 


use the western Hindus (known as Arabic numerals) in Arabic 


mode. This is not desirable. 


a2 


DS990 stores information in memory in logical order in 
Latin mode and Arabic mode. The display ROM interface and 
the control program map the internal representation of one 
code to one or two display codes. For example, to display 
the character 'SEEN' as in the basic character set (CB hex 
value) (Appendix H), the character is represented by two 
display codes. The first code is the value BC hex followed 
by the code 8B hex in the display ROM interface table. 

The approach followed by TI is the typical way most 
companies are implementing their display techniques. How- 
ever, the disadvantage is the omission of diacritics and 
considering "LAMALEF" as one character. TI has indicated 
they now believe the implementation must have diacritics. 


(Ref. 1] 


fe ALIS INC., BCON SYSTEM 

ALIS Inc., introduced BCON !™ as a bilingual operating 
system that could be a standard to follow, or at least close 
to a standard. The bilingual operating system adopted the 
Single key single code approach. Each character is repre- 
sented by a unique code internally in memory. BCON also 
fully supports the diacritics use in text. BCON was geared 
toward MS-DOS based microcomputers. The bilingual operat- 
ing system is an interface between the MSDOS operating 
system and applications. BCON is designed to facilitate the 
adaptation of the large number of existing MS-DOS 


applications to Arabic (Ref. 2]. The single code approach 
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as mentioned before requires that some device or interface 
(hardware or software) properly analyze the character and 
display the correct form. BCON uses Application Screen 
Image Compensations (ASIC) to per-form the contextual 
analysis, and then selects the correct display code 
(Appendix D). 

1. Hardware and Software of BCON 

BCON hardware is another board on top of the Latin_ 
character generator board. The new board has the Arabic 
character generator with the required wiring to allow con- 
current operation of both character generators. The two 
boards are back to back and use one slot on the mother 
board--a microcomputer. Keyboard caps (or stickers) are 
provided for use on the keyboard. The stickers have both 
alphabets printed side by side. 

The software is a program which when activated, 
resides in low memory and uses 19k bytes. Once BCON is 
activated, it can be set in Latin "native" mode or Arabic 
mode. The only way to free memory is to reset the system. 
Both modes of the operating system will allow bilingual 
insertion in the appropriate direction. In their early 
version (up to early 1985), ALIS introduced a reduced code 
called Arabic Reduced Code Information Interchange (ARCII). 
ARCII is the internal representation of the characters in 


memory and what is seen by the operating system. 
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Be ARCIT Code Set 

Arabic Reduced Code for Information Interchange 
(ARCII) is ALIS's early attempt to define a code set. The 
reduced code (ARCII) (Appendix D) is the internal represen- 
tation codes of data in memory. The ALIS reduced code is 
completely Hiitterent™ from early proposals for a target 
standard set proposed by ASMO (further details will be 
covered in the next section). 

The code uses the graphic characters for the Arabic 
set. By assigning one to the 8th bit, 128 additional codes 
are available for Arabic codes. This allows the BCON bilin- 
gual system to mix codes and use both ASCII and ARCII. 
There are 46 different codes assigned for the alphabet, 
starting with code DO hex and ending with FD hex. ARCII 
places the diacritics early in the table to give them pri- 
ority in sorting algorithms. This early positioning in the 
table was not favorable, however. The reasoning will be 
discussed when the standard code and the format justifica- 
tion are discussed. The escape codes and special characters 
should not be redefined for ARCII if similar ones in Latin 
exist. This minimizes the code set for ARCII, freeing more 
code for future expansion. Codes for functional codes could 
be minimized by using the international one. 

ALIS reduced code is completely different from early 
proposals for a target standard set. The Arab Organization 


for Standardization and Metrology (ASMO), after several 
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years of research and after meeting with Arab representa- 
tives, recommended the use of CODAR U-F.D. as a standard for 
Arabic codes (further details will be covered in the next 
section). Subsequently, ALIS and other companies adopted 
the new code set in order to assure compatibility with other 
applications and implementations. BCON's original version 
of reduced code (ARCII) (Appendix D) is the internal repre- 
sentation of information in memory. 

The form or appearance of characters is not a major 
issue as in how it should be displayed. This is dependent 
on the machine resolution and capabilities. The fonts and 
style of displayed texts vary from one machine to another. 
ASMO has recommended that the style of displayed text be 
left to developers. This has left a lot of room for manu- 
facturers to be creative and compete for quality work for 
the benefit of the user. 

3. Operating Principles of BCON 

BCON, once loaded, resides in memory using 19k of 
low memory. BCON has three code sets. The three code sets 
are: reduced code (ARCII), key code and display code. 
Figure 7 shows how the three codes are integrated with each 
other. A list of the three code sets is provided in Appen- 


dix D. ARCII includes the diacritics as a part of the code 


set. This was set as a requirement of the CODAR U-F.D. 
standards. BCON receives the key code and stores it in 
memory in reduced code form. The reduced code form is 
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analyzed by BCON and contextually analyzed and displayed in 
eae correct forn. In the display process, BCON appends if 
necessary what is called "TAIL GENERATION" to some charac- 
ters if they fall at the end of a word [{Ref. 2]. 

The early work on BCON, as well as the work of other 
companies, must be modified to correspond to the new 
standards. ALTS in early 1986 introduced a new mode in 
addition to ARCII. The new mode uses the ASMO approved code 
set. No documents are available at this time. However, as 
mentioned before, previous effort was not totally lost. The 
company still utilizes the contextual analysis developed 
earlier, with minor modifications. The same is true for 
their printer driver software. This is a good example of 
how early development enables a company to react quickly to 


new demands. 


D. ASV CODAR-U SYSTEM 

In researching the early efforts initiated by official 
organizations or government agencies for inter-Arab unifi- 
cation of the codes set, two names were always associated: 


CODAR and Dr. Lakhdar. A few acronyms are important here: 


CODAR : Code Arabs (French) 
ASV : Arabe Standard Voyelle (French) 
IERA : Institute d'Etudes et de Recherchers 


I'Arabisation 


IBI 


Intergovernmental Bureau for Informatics 


COARIN: IBI Committee on the use of Arabic in Informatics 
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ALESCO: Arab League Education Cultural and Science 


Organization 
SASO : Saudi Arabian Standards Organization 
ASMO : Arab Organization for Standards and Metrology 


Dr. Ahmed Lakhdar Gazal, Director of IERA (Institute for 
Research and Studies for Arabization in Rabat, Morocco) has 
been associated with the CODAR project for several years. 
Dr. Lakhdar proposed that the Arab nations adopt the CODAR 
system as a standard for telecommunications. IERA was 
working as far back as 1955. The standardized Arabic Code 
waS a dream many people were expecting and needed for many 
years. However they have no power over defining it or 
making it official, assuming it is acceptable. 

The CODAR system is a long-going project that is geared 
for setting standards for several fields of interest. The 
project covers: 


PRINTING 


TYPEFACES 
- TRANSFER LETTERS, SELF-ADHESIVE TYPES 
- SLUG-CASTING MACHINES 
- MOVABLE TYPE COMPOSITIONS-CASTER 
- PHOTOCOMPOSITION 
LY PEWRITERS 
INFORMATICS AND DATA TRANSMISSION 
TELECOMMUNICATIONS 
This chapter is concerned with Informatics and Data Trans- 


mission. However, a lot of credit must be given to 
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personnel behind CODAR. iGmeOookeecODAR agiot of effort and 

dedication by IERA's staff to accomplish a unification. A 

long list of acknowledgments, appreciation, and financial 

support letters were coordinated by CODAR from several coun- 

tries and organizations. A list of participants include: 
Moroccan Ministry of Education (1956) 


First Conference of the Arab National Commissions for 
UNESCO (1958) 


First Conference on Arabization (Rabat, 1961) 
UNESCO (Arab book-keeping experts meeting) (Cairo, 1972) 
A long list of occasions and dates are listed [{Ref. l1:pp. 
207-210). 
Under OEM tis and Data Transmission there were three 


versions of the 7-bit code system. They are: 


Seven bit CODAR I : first coding scheme of the ABV 
characters 
Seven bit CODAR II: a proposition for a unified Arabic 


coding scheme, discussed at regional 
(IBI) meeting at Bizzert, TuniSia, 
June 1976 
Seven bit CODAR U : unified coding scheme for the Arab 
countries proposed by cCOARIN (IBI 
committee on the use of Arabic in 
informatics) at a meeting in Rome, 
June 1977. 
The seven bit CODAR I, CODAR II, and CODAR U (Appendix 
E) are code set proposals. CODAR I was produced by EURAB 
and the printers were manufactured by the Italian firm SELI. 
CODAR II is a subsystem of CODAR I. The subsystem can be 


obtained by removing all possible combinations of "Harakat" 


(i.e., Fat'ha, Kassrah, and Dammah) with the "Shaddah." The 


SyS, 


subsystem also leaves out three Persian characters, opening 
and closing square brackets, backslash and a few character 
variant shapes. 

CODAR U fully supports vocalization with all possible 
"Shaddah" combinations with the "Harakat." This system is 
the closest to being acceptable by ASMO and approved as a 
standard. ASMO's approval will give the system official 


status. 


E. THE STANDARDIZED SET 

In 1980 CODAR U was accepted as a working basis for a 
basic code set. Recommendations and modifications were to 
be presented to ASMO in order to formalize the code set. 
The next step was to distribute it to ASMO's' members. 
Member countries insure that it is implemented accurately. 

During a meeting held between 22-24 April in Rabat 
(Morocco), the final code for the proposed standard, called 
CODAR U-F.D. was finalized and submitted to ASMO along with 
six recommendations (Appendix F). The conference recom- 
mended ASMO to distribute and test the code by IERA, SASO, 
and the National Center for Information in Tunisia before 
enforcing the code. ALESCO and ASMO were also recommended 
to make every effort for the adoption of the code by all 
Arab countries. 

Finally, on October 21, 1982 ASMO adopted the code pre- 
pared by IREA, and ALESCO. This code was the result of the 


CODAR U-F.D. proposed in April, 1982 at Rabat. The 
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modifications and changes are included (Appendix G). There 
are a few points to consider. There are 31 codes for the 


alphabets, 3 codes for "Harakat," 2 codes for "Shaddah" and: 


"sukoon," 5 codes for "Hammzah," 3 codes for "Tanween," 
totalling 44 codes. Their location must not be changed in 
the table under any circumstances. The "Hamzah" in all 


variations, on top or under characters, are considered forms 
of "Hammzah." The "Hamzah" is placed in the beginning of. 
the code table, which in searching means any character with 
"Hamzah" associated with it should be expected higher in 
order (equivalent to "A" in Latin). This concept will con- 
fuse users when searching or sorting. The results may be 
surprising for sorting algorithms. In sorting, the table 
allows a simple sort. Errors will result from the occur- 
rence of diacritics and the code 60 hex in the table (6/0). 
The code 60 hex is used for connection or extending a word 
for formatting purposes. So a sorting algorithm should 
strip text of the diacritics and the connection dash 
(Similar to Latin underscore) first, then sort the text 
according to the basic 31 character code. The user must be 
educated about all the remarks mentioned in the reasoning in 
ASMO's final form of the code set. Another convention was 
that the character comes first in words that are vocalized. 
The form to follow is: 


WORD ::= { <CHARACTER> <SHADDAH> <DIACRITICS>} * 
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So the "Shaddah" comes before the diacritics if used for a 
character. The second convention is if the pure word 
matches in sorting, the diacritics then should be:used by 
the sorting algorithm as qualifiers. In my opinion, this 
violates the Regularity Principle in programming, where the 
user must be concerned and remember all the exceptions. 


This does not in any way mean there is an eaSier way. 


F. CONCLUSION 

The ASMO code set is the standard Arabic code set the 
Arab countries must enforce in their countries. 
Subsequently all companies in the area must adopt and use a 
standard code set. The competition is now directed toward 
improving the display application with high resolution and 
graphic capabilities. Printing devices also are an area for 
manufacturers to compete in printing different Arabic styles 
and fonts. The contextual issue is left as a flexible issue 
to the implementors to research and develop for their indi- 
vidual products. The display form of text on monitors and 
printing devices will not affect the internal representation 
of the data, which must be compatible with the standard code 
set. This may result in several display sets developed by 
the companies as their view and intention of displaying a 
good Arabic text. Hopefully this should create a stable 
base to work with and encourage development of products 
based on the ASMO standards and conventions listed in 


Appendix G. 
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V. INTERFACE DESIGN GENERAL APPROACH 


The lexical translator will generate Latin code from an 
Arabic source code in Pascal syntax. The Pascal compiler 
can compile/run the Latin code to generate an output. The 
interface will generate a correct Latin code given that the 
Arabic source code is in correct syntax. The Prancinoe 
will give minimum help to correct the Arabic code. The user 
must understand the syntax and the semantics of the language 
to write correct source code. The interface is not an 
imecactive type of translator. The design is generally the 
same for all Pascal compilers. The interface must always 
consider the environment it will work in. The interface has 
two environments to consider: the source code bilingual 
system, and the compiler environment. From the portability 
and compatibility point of view, the translator will be 
limited to a particular Arabic standard, and a particular 
PASCAL implementation. 

The bilingual implementation has its own function codes. 
Those codes are embedded within the Arabic source code, if 
generated under the bilingual operating system. The 
bilingual operating system used here is BCON from ALIS, Inc. 
There is a list of function codes in Appendix D. The PASCAL 


compiler used here is TURBO PASCAL from Borland, Inc. 
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The Arabic implementation utilizes the upper half of the 
255 character set used by graphics to display Arabic fonts. 
Some Pascal compilers will accept any of the 255 characters 
as legal characters for use in string data. Turbo Pasecur 
for example, allows the entire set of 255 characters. This 
is one reason why Turbo Pascal is used in this thesis as a 
target environment for the generated code. The interface 
will, however, generate a correct PASCAL code even if the- 
source code follows standard Pascal. 

The compiler will always refer to the Turbo Pascal 
compiler even though, from a theoretical point of view, it 
should be any Pascal compiler. Similarly, since there is no 
standard representation of Arabic data, i.e., available and 
implemented, we use the BCON operating system, using ARCII, 


as the internal representation of data in memory. 


A. MAJOR CONCEPTS 
The interface looks at any piece of code (token) as one 

of several types. These types are: 

- Literal string 

- Comment 

- Integer 

~- Identifier 

~ Functional operator. 
Literal strings are constants and the interface does not 
alter the ASCII value. The comments are surrounded by '(*! 


and '*)' in Arabic equivalent codes. Integers are important 
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and easy to handle since there is an isomorphic relationship 
between Arabic integer tokens and Latin. A real number 
token is made up of two integer tokens separated by a func- 
tional operator. An identifier is any legal name in Pascal, 
either a reserved word or user-defined. Functional 
operators are all the codes that are used for addition, 
brackets, pointer arrows, etc. In setting the specification 
for programming in Arabic Pascal, the optimum goal is to 
have a one-to-one relationship between the Latin and the 
Arabic special characters. Also we want to avoid overload- 
ing the use of special characters. 
1. Literal Strings 

Literal strings are used for assigning into string 
variables and for read and write commands. Strings are used 
to interact with the user in an application and understand 
the performance of the program. Therefore we do not alter 
these strings. The literal string is any string of charac- 
ters surrounded by single or double quotes. It is the pro- 


grammer's responsibility to verify the content of an 


asSigned string. The literal string can have any character 
of the entire set 80 hex ... FF hex. 
2. Comments 


The comment length is limited to one line. The com- 
ment is enclosed by an opening bracket followed by an aster- 
isk, and ends with an asterisk followed by a closing 


bracket. When the translator encounters the beginning of a 
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comment it looks for the end of the comment. The comment is 
considered as one token. The translator will not alter the 
content of the comment since it is for the use of the pro- 
grammer only. 

3. integers 

Integers are any consecutive digits from 0-9 with no 
separation in between. For example, the integer printing 
format "2245:6" is considered as three tokens as far as the. 
translator is concerned. The first token is the integer 
"2245," the second is functional operator ":", the thilmaiee 
the integer "6" token. 

Real numbers are made up of three parts as one would 
expect. They are integer token, Arabic numeric comma, and 
integer token. 

4. Identifiers 

All legal Pascal names fall under this category. 
This includes reserved words, and variable names. The token 
1S identified first as an identifier, then looked up in the 
reserved words group. If it is not in the list then it is a 
variable name. Variable names include variables, labels, 
procedure and function names. When an identifier is encoun- 
tered and it is not a reserved word, then it iS given an 
identifier number. The identifier number is stored with 
other information about the token in a hashing scheme ina 
symbol table. The token is looked up in the symbol table. 


If it is not entered, then it will be entered in the 
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beginning of the link list of the same hash key. Since the 
primary user of the translated code is the compiler, the 
program will have meaningless variable names. However, the 
translator will generate a file called "DICTIONARY" contain- 
ing each identifier number and the Arabic token associated 
ween it. 

5. Functional Operator 

Tokens are identified by. separators and terminators. 
Blanks are separators, aS well as other codes that have a 
function other than being separators. For example, the plus 
and minus sign as well as the up_arrow symbol in PASCAL are 
separators. If, for example, the variable root’.left_sun 
was the Arabic token it will be translated into something 
like, id_1%.id_2, where the identifier numbers are entered 
for the Arabic tokens. 

The scope of the variables will distinguish fre- 
quently occurring variable names. If id_1 occurred in two 
declarations, the compiler will distinguish between two 
occurrences of id__1, depending on the location of the 
declarations. Therefore the translator does not need to 


concern itself with multiple uses of the same name. 


B. OPERATING PRINCIPLES 

The translator goes through several phases and each 
phase has a sub-task. The process begins with the name of 
the Arabic source code file. The file is opened, the target 


output file is initialized and a dictionary table file is 
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opened. The second phase fills a buffer with a code segment 


of the source code, a line at a time. The line is broken 
into tokens. Each token iS given a type and then 
translated. The cycle is repeated for each lineup to the 


end of the source file. 
1. File Opening and Initializing Phase 

The program starts with the prompt for the user to 
input the source file name. The file name is checked for- 
existence and then reset for reading. The file name is used 
to open two more files, the dictionary file, and the output 
file. The initialization is concerned with the hash table 
that has information regarding the record structure of the 
identifier's symbol table. The rest of the parameters are 
optional features such as to list the source comments with 
the output code. Another feature is debugging for tracing 
the program in the translation while the translator is 
scanning and translating the source code. Both comments and 
debugging features should be easily set at any point in the 
source code. The rest of the parameters, for example, line 
number, identifier number, are initialized. 

2. Reading and Decomposing the Source Code 

An input buffer is filled from the source code and 
scanned. A line at a time is read from the buffer and 
checked for special instructions (directives) for the 
translator. If the line is not a directive, it is checked 


to see if it is a comment. If the line is a comment or 
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starts with one, then the comment is either omitted or 
written out depending on the comment option. The comment 
option is a Boolean variable set by the user within the 
program source code, to either omit or write out the comment 
tokens in the generated file. The line, or the remainder of 
the line then, iS decomposed into tokens. Tokens are 
identifiers, integers, blanks, Or special characters. 
Identifiers are either reserved words or user-defined 
identifiers. Reserved words are matched with their 
associate Latin reserved word. User-defined identifiers are 
given a label number in the sequence of their first 
appearance, if it does not already exist. Integer tokens 
are scanned and each digit is mapped into its matching Latin 
digit. Special characters are given their equivalent Latin 
characters, such as Arabic and Latin semicolon. Blanks are 
copied as it makes for better formatting of the generated 
code. 

The investigation of the token type is based on the 
first character of the next token in the input buffer. For 


example, if the first character is a: 


- Letter: Then investigate the possibility that it is an 
identifier. 
~ Digit: The token must be an integer. 
- Other: Then it must be a special character. 
In this phase only the identifiers are translated. When a 


user-defined identifier is encountered, and, if it has not 


previously been recognized, it is given the next identifier 
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number in sequence. Reserved word tokens are stored in a 
constant table, in a record format. Each record has an 
Arabic word and the matching Latin one. Any identifier 
token is first looked up in the table. If found then the 
index of the matched record is passed back to the main 
program. The integer tokens are given the type integer and 
passed back to the main program. If any of the above is not 
true then we get one character and pass it individually. 

In short, each token is given a token type, length, 
and passed back to the main program. Reserved words are 
passed back with the match index additionally. Identifiers 
are also inserted in the symbol table. If not found, their 
identifier number (in Latin characters) is passed back. 

3. Token Translation Phase 

The tokens are translated into Latin-based on the 
token type. The integer tokens are translated by mapping 
each Arabic (Eastern Hindu) digit into its Latin (Western 
Hindu) associated digit. Reserved word tokens are 
translated by writing their matched Latin reserved word, 
using the match index found earlier. User-defined identi- 
fiers are replaced by the identifier number assigned to it. 
The rest of the special characters are looked up ina "CASE 
OF" (a PASCAL control statement) list or assigned into a 
constant table (array). This model uses a case statement. 


As each user identifier is trans-lated and written out in 
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the output file, it is also written out in the dictionary 
table along with the Arabic token associated with it. 
4. File Closing and Ending 

The last phase is to close the source file, diction- 
ary, and the generated output file. This phase will only be 
reached at normal program execution. The program will ter- 
minate if there is a character code not in the range of the 
Arabic alphabet defined by the bilingual operating system... 
Long tokens and comments will cause errors and should stop 


the translation, since translating a comment makes no sense. 


C. DESIGN GOALS 

The interface is supposed to generate from any Arabic 
source code a Latin code in PASCAL syntax. The Arabic pro- 
grammer must master PASCAL programming in his' native 
language. Essentially little syntax and no_- semantic 
checking will be performed on the source code. The com- 
piler job is to scan and perform the syntax and semantics on 
the translated code. Some help must be provided for 
tracing, and debugging should be incorporated into such an 
interface. The compiler gives the error messages in Latin. 
This could be utilized in several ways. One way is to keep 
the line numbers of the source code and the generated code 
as close as possible. The error messages usually are stored 
in a text file and can be translated. This, along with the 
line number of the error location, can be combined to give 


the location and type of the source code error. 
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A second way, if the error messages cannot be translated 
in their file, is to translate the error messages and return 
them out with the error number. The Arabic programmer can 
look up the error number in Latin and the line number of the 
error, then look up the translation of the error and 
explanations. In both ways a few hints regarding the errors 


and possible causes should be provided to the user. 


D. DESIGN LIMITATIONS 

The design does not use or handle diacritics at all as 
far as reserved words are concerned. This could cause error 
and personal interpretations of how the reserved word is 
written. Since most reserved words are clear once read, the 
user must not type any vowels with the reserved words in the 
program. Similarly, to not duplicate the translation of a 
Single user-defined identifier, and eliminate the complica- 
tion of debugging of such cases, the user should not use the 
vowels in his defined identifiers. The diacritics may be 
used in literal strings and headings of reports. Several 
factors may affect and prevent the use of diacritics. Some 
sorting routines sort independently of diacritics. Since 
vowelization can upset the sorting order and the rules for 
sorting the same name with different vowelization. A second 
reason is that the location of the vowelization of the 
character is not standardized. A third reason is that the 


resolution of terminals is poor and hard on the eye to 


ae 


distinguish, for example, between the "FAT'HA" and the 
single quote symbol in printed or displayed form. 

The design therefore will not handle vowels in the 
Arabic source code. However, it should be noted that the 
option of including the diacritics requires few changes in 
the design, and a lot of attention from the Arabic pro- 
grammer. The attention is required to rewrite his own 
sorting routine that sets the ARCII value for the vowelized 
source code. Also the programmer must be consistent with 
his use of vowels with identifiers for the above reasons. 

The display and print justifications cannot’ be 
controlled easily within the program since the bilingual 
operating system does not use a standard unified code for 
Arabic display and print mode. For example, in BCON, the 
operating system used for the implementation of this thesis, 
if you are editing an Arabic screen mode then the curser in 
the entire code will start at the far right of the screen. 
This right justification is for the Arabic format and inden- 
tation in Arabic texts. Therefore, if you exit the editor 
you must set the screen mode to Latin screen mode, otherwise 
the "C:>" prompt will be displayed in the far right of the 
screen. So for the sake of simplicity to the user and 
consistency on the behalf of the generated codes, the 
display codes are left out of the translator control and are 
under the control of the display system of the bilingual 


operating system. The modes can be set with an external 
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escape code to the printer or a sequence of key strokes to 
set the screen to Arabic mode. 

“These limitations can be resolved once there is a 
standard set. I believe the bilingual operating system 
should by default handle the justification issue, and allow 
the user to turn this option off. This is in the range of 
two to five years to come in the industry involved with 


Arabic text handling. 
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VI. PROGRAM MODEL 


[Ne INTRODUCTION 

The Lexical Translator program is intended to be simple, 
flexible, and to demonstrate feasibility of the concept. 
Speed and efficiency was not a primary goal. Features can 
be added as needed based on the response of users of the 
program. 

The program will require the supervision of a good 
PASCAL programmer to assist the compilation and execution of 
the translated code. The assistance could be achieved by 
Simple detailed instructions on how to use the program to 


generate output code. 


i. PROGRAM ENVIRONMENT 

The Translator is developed under a certain environment, 
and until there is a unified standard for a bilingual 
operating system, program portability and compatibility will 
be limited. 

1. Hardware Environment 

The program is developed using an IBM XT personal 

computer, It can be just as well developed using an IBM PC 
Jr., or IBM At. The IBM XT has 640 kilobytes of RAM memory, 
20 megabyte hard disk, two half height floppy disks, and the 
ALIS Inc., graphics board. The board is made up of two 


boards back to back. The first board is a Paradise color 


eyo. 


graphics board. The second board is on top of the paradise 
board and it has the Arabic character .enerator and the 
necessary connection circuitry needed. The two boards fit 
in one slot on the mother board of the XT computer. 

The keyboard is an IBM PC keyboard with cap stickers 
for the keys. Each sticker has two to four different 
characters, for Arabic and Latin. The keyboard layout is 
displayed in Appendix D. 

An Epson FX 85 dot matrix printer is used for the 
listing of the program. The printer has an Arabic driver to 
display Arabic characters. 

2. Software Environment 

ALIS Inc., BCON bilingual operating system was used 
in developing the thesis program and test runs. BCON 
resides in low memory using about 20K bytes. The BCON is 
supposed to be transparent to the DOS operating system. DOS 
Stands for Disk Operating System used by IBM microcomputers. 
The BCON operating system requires special skill and more 
than average user knowledge. BCON is mainly required for 
generating the Arabic fonts, and interpreting and mapping 
the key strokes to their associated ARCII values. The 
interpretation and mapping are performed under the Arabic 
mode only. The Arabic characters are stored as hex values 
ranging from 80 hex up to FF hex. This range of values is 


reserved for graphics under the DOS operating system. This 
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means any Arabic character code is considered a graphic 
character in the absence of BCON. 

An important concept must be pointed out. The 
presence of BCON is to display the right form, font, and the 
indentation of Arabic text. So with minimum skill, a pro- 
grammer can develop, review, correct Arabic characters in 
any DOS compatible machine. Then the result can be dis- 
played under BCON, where BCON can interpret the graphics 
character as ARCII code, and display the correct textual 
form of the ARCII code by sending the appropriate display 
See to the terminal or the printer. 

When writing long Arabic texts, it is much easier to 
do so under BCON, with the aid of an Arabic word processor. 
The simple EDLIN editor available on DOS distribution disk, 
or Turbo PASCAL editor of version 2.1 and below, will work 
also. There is some limitation to what one can use under 
BCON and still display Arabic characters. BCON requires two 
conditions for compatibility when using any application. 
First BIOS interrupts* 16 Hex and 10 Hex are called to 
access the keyboard and the screen respectively. Second, 
the application must handle 8-bit characters. PRED 202 pb 
3-1] 

Turbo PASCAL version 2.1 was used to write the main 


program and resource file. The printer interface, called 


2Information about the interrupts can be found in DOS 
technical manuals for personal computers. 
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MPD by ALIS [Ref. 2], is implemented for several printers. 
The name stands for Multi Printer Driver. The MPD was used 
to drive the Epson FX 85 to display the Arabic characters in 


the program listings, and sample tests (Appendices H, I). 


C. PROGRAM BODY 

The Lexical translator is designed to be easily modi- 
fied and should be done when the updated version of BCON 
utilizing the unified standard code set is available. The 
program is modular and could be rewritten in "C" or FORTRAN. 
The program is designed to generate a correct output file 
from a correct input source file. The program will not 
interpret the result and the programmer must exercise crea- 
tivity and care as his/her programming advances, to assure 
correct results and clear output. 

The printable output of any developed program is either 
a string of characters, or mathematical results. Since any 
string assignment is not altered, this will result in no 
difficulties for string output. If the result is a real or 
integer number, the result will be displayed based on the 
BCON digit mode. The program did not concern itself with 
numerals since all the users are familiar with the Western 
Hindu Numerals (Latin). Also, BCON has an option that 
allows the user to swap the digits in the operating system 
environment. So for BCON, analyzing the results of numeric 
calculation will be duplicating the same work. This may be 


a limitation under an operating system other than BCON. 
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1. Program Files 


The program has two main files that are used for the 
generation of the output code. The main file and the 
resource file. The main file contains constant declara- 
tions, data structure declaration, variable declarations, 
procedures and functions, and main program body. 

The resource file has the assignments of a constant 
array declared in the main program and is used as an include 
file. The resource file has a subset of the reserved words 
and standard function names. The resource file iS a very 
useful modular concept since you can replace the PASCAL 
resource file with one for the language "C". With minimum 
changes in the constants and directives one could use one 
Translator with several resource files, one for each 
language, to Lexically translate from Arabic to one of many 
Latin compilers syntax. This program focus is on the Turbo 
PASCAL syntax. 

2. Generated Files 
The translator will generate two files: 


- A Dictionary file with the same name and "DIC" 
extension. 


- An Output file with the same file name and "PAS" 
extension. 


The program will generate the desired output in the "PAS" 
file. The dictionary file will be updated each time an 


identifier is encountered for the first time. User-defined 


Se, 


Arabic identifiers are translated to identifies of the form 
"ig 000 ... id 999." 
3. Key Variables and Data Structure Declarations 

The external file "Resource.Pas" is an assignment of 
a constant array. Each element of the array is a record. 
The record has two components. The first component is the 
Latin reserved word or function name, and the second 
component is the Arabic translated (matching) word.? 

The user-defined identifiers are handled by a 
hashing scheme and a symbol table. The decision was to 
demonstrate an efficient way to store and retrieve identi- 
fiers. The lexical translator will be constantly looking up 
any non-reserved identifier in a symbol table to insert it 
or to get its Latin match if predefined. To improve effi- 
ciency, the program uses a direct chain Hashing scheme [Ref. 
3:p. 45). 

The identifier is passed to a function and Siven a 
key number by Function_KEY. With a hashing formula the 
function calculates the key number of the identifier. The 
key number is a location in the Hash table. The content of 
this specific location is pointer to a word_record which 
either contains the word or is where a new record should be 
inserted in case the word was not found. Words having the 
same key number will be linked together in a linked list. 

3The translation is in no way a standard or profes- 


Sionally translated. The translation was made for demon- 
stration purposes. 


60 


The incident of having several words with the same key 
number decreases the efficiency of Hashing (see Ref. 3 on 
how to avoid Hashing collision and when to use Hashing). 


The word record has the following: 


Id_No - the identifier number in the sequence o 
insertion. 

Length - number of characters of the identifier. 

Lastchar - location of the last character in the 


symbol table... 


Nextword - pointer to the next identifier with the 
same key number. 


Latin_Id the Latin identifier assigned to the 


identifier. 

With the above word (identifier) information, we can locate 
the word in the symbol table. The spelling table is 
declared as an array of 5000 characters. The size is an 
estimate and can be changed as one can predict a closer 
estimate. The symbol table is implemented as a linked list 
and its size can vary dynamically so as to be as large as 
necessary. 

The translator looks for tokens using two methods. 
The first method uses a pair of delimiters to identify the 
token. The pair define the beginning and end of a token. 
Token classes that can be identified by this method are 
comments, literal strings, and directives. 

The second method recognizes a token by its first 
character. Examples of this class are integers, and identi- 


BTers . The second method includes tokens with one character 
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such aS separators and terminators. Both separators and 
terminators will be referred to as delimiters throughout the 
program. The delimiters are defined in a constant set. The 
Hex values of the set can be interpreted with the aid of the 


ARCII table (Appendix D). 


Errors are a user-defined data type. Types of 
errors are, for example, long_token, long comment, and 
long_literal string. All of the above errors are expected. 


to occur as a result of failure of the programmer to end a 
comment or literal string. 
The token types are defined to be one of the 
following: 
- Blanks 
- Reserved_word 
- Identifier 
- Literal String 
- Control Code 
- Comment 
- Integer 
- Functional Operator 
- Unclassified 
= livegall 
These are the main declarations of the program. The 
definition of the tokens and assignments of the variables 


will be covered in the following sections. 
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ce Token Classes I and II 





Class I tokens are recognized using the first 
method. This includes the following types of tokens: 


Literal_String: This token begins with Arabic quote mark, 
Single or double, and ends with it. The 
Hex values are 97 Hex and A2 Hex. 


Comments : Begins with right bracket followed by 
asterisk and ends with an asterisk 
followed by left bracket. 


Directives >; Are strings -in curly brackets. This- 
feature is for debugging. The directives 
will allow the user to choose between 
commented Latin source, with original 
comments, and debugging option to display 
on the monitor the tokens and their 
types. 


Class II covers the identifiers, including reserved 
words, and integers. The remainder of token types will be 


reviewed shortly. 


Identifiers and Reserved Words: Begin with an Arabic 
letter followed by an optional number of 
underscore, Gaga or other Arabic 
Characters. 

Integers : Begins with digit and ends with any non- 


digit character. 
The remainder of the token types are Functional_Operator, 
Tllegal, and Unclassified. Functional Operator tokens are 
the arithmetic operators, brackets, asterisk, decimal digit, 
eemrcolon, colon, pointer '“', etc. The illegal token is 
the token that exceeds its defined length. This condition 
is used to set an error message to pass to the user about 
the location of an error. An Illegal token is also set if 


the Hex code is less than 80 Hex. The legal range is 80 ... 
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FF Hex. The control code is any escape code or function 
call within the range of Arabic characters ranging from 80 
Hex ... FF Hex. The Unclassified token type is used as the 


Value before Tt Pe cdetermnimeda. 


De PROGRAM MODULES 
The Lexical translator will call several procedures and 
functions in the process to generate the desired code. The 
main body of the program calls several procedures and 
funCELTOnS. The program modules and their locally declared 
procedures and functions are as follows: 
Open_File 
Initialize 
Fill Buffer 
Token_and_Type 
Blank 
Comment 
Literal String 
Integer_Token 
Identifier Token 
Reserved_Token 
Special _Char_Token 
Control_Char_Token 
Map_Identifier To Latin 
Search 
Hash_Key 


Insert: calls Id_No 
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Found 
Latin_Integer 
Get_Latin_Spec_Char 
Print_Error_ Messages 
mr) Open_File 
The program starts by calling the Open_File 
procedure. The procedure will prompt the user for the name 
of a file to translate and verify that the file does exist. 
The second part is to open the input file for reading, reset 


the Output file for writing, and the Dictionary file for 


writing. 
2. Initialize 
Initialize procedure will set all the hash table 
pointers to nil. The nil values are used to indicate that 


there are no words with that key number yet initialized. It 
will also set the initial values of global variables. The 
module is called once at the beginning of the program. 
fee Fill Buffer 
This procedure will get a line of source code, keep 
track of the line number of the source code, and set the 
line size of the source code. This module is continuously 
called by the main program until the end of the source file 
is reached. 
4. Buffer_Empty 
This function will test to see if the variable Next. 


Loc, which represents the next token location on the line, 
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is pointing beyond the Line_Size variable. This case will 
set the function to true, causing the main program to call 
the Fill Buffer procedure to refill the buffer. This module 
is called continuously by the main program. 

a) Token_And_Type 

When called, this procedure is passed a line of 
source code and the location of the first character of the 
token to be fetched. The procedure gets the token and gives. 
it a type. The procedure initially sets the type of the 
token to Unclassified and through several calls, tries to 
analyze the type of the token. The first convenient check 
is for Comments. It should be noted here that one would 
like to place the most likely type check at the beginning to 
reduce time of analysis of the token type. Another reason 
for searching for comments first is because they are the 
only type that requires two characters in the beginning and 
the end of the token. The rest can be predicted just by 
inspecting the first character. 

If the token type is not set to Comment, then the 
module calls several modules with a case statement. The 
modules are called based on the first character after the 
last token read. The Next Location variable points at this 
character in the input line buffer called "Line." The 


possibilities are: 


66 


FIRST CHARACTER LIKELY TOKEN TYPE 
Arabic space Blank(s) 


Double or Single Quotes Literal string 


Arabic Digit Integer 
Arabic Letter Identifier 
Function Code Control Char 
Other Characters Special Characters 
Each possible token type above represents a module. The 


module will be called to set the type of the token. 

Looking at each module called by Token__and_Type, 
they all set the token type and the length of the token. 
All likely token types except for Literal Strings and 
Comment will not set any error flags, since one character 
Will satisfy their requirements. For example, Blanks, 
Integers, Identi-fiers, Control Characters, and Special 
Characters all could be one character long. When Literal 
String and Comment modules are called, they must begin and 
end with a predeter-mined pattern. So an open comment for 
longer than line length is an error, and the same for a long 
literal string token. Token_And_Type only examines the Line 
Buffer charac-ter and does not consume it. The called 
modules assign the character to the Token Buffer and advance 
the pointer of the Line Buffer one character. When a 
successful, token type is assigned the module sets the token 
length. PASCAL uses the first array location to store the 


length of the assigned characters in bytes. 
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The behavior of the modules called by Token_and 


Type, are summarized below: 


Blanks: 


Comment: 


Will keep consuming the Line Buffer blanks 
(Arabic and Latin) up to a non_blank character 
is reached. Blanks will set Token Type and 
Length. 


Consumes the characters within the Arabic 
characters range, until the comment closing 
mark is reached. The module will set the 
error set to long_comment, if any character 
lies in the Latin alphabet range, including 
the end of file and carriage return (ASCII OD,- 
OA Hex). The error is long comment since the 
comment is restricted to one line long. 
Comment alters the opening and closing bracket 
of the Arabic comment token. The characters 
are the Arabic opening brackets, closing 
brackets, and the asterisk, having the Hex 
values A8, A9, and AA respectively. 


Literal _ String: The module will be called in case the 


Integer_Tok: 


next characters are single or double quotes. 
The module will expect to be terminated with 
the same character it began with. If the 
matching character is not reached before the 
end of the line it is considered an illegal 
token, and the error set will be assigned the 
type long token. Valid literal strings will 
not be altered. However the opening and 
closing will be translated to single or double 
quotes accordingly. 


Stands for integer token, and will be 
called when a digit is present. The module 
will keep assigning the Latin digits in the 
token buffer, and assign the Token__Type 
Integer to the variable Tok_type. 


Identifier Tok: Will be called when the character is a 


letter. The single letter qualifies as an 
identifier alone, or could be followed by an 
optional number of Arabic underscore, digit, 
or letter. The module will set the Tok_type 
to Identifier. The module has no effects on 
error set, since when called it was a valid 
token based on the first character of the 
token. 


Reserved Tok: The module is called when the token found 


1S “an Giadenta fren. The module will check if 
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the token is in the reserved words constant 
array called "Res Word." If the identifier is 
a reserved word the index of the table is 
passed back to the main program. 

Sentcrol Char Tok: The module is called when a BCON 
function code is the next character in the 
Line_Buffer. The module assigns one character 
(code) to the token buffer. 

Special Char: This module assigns one character to the 
token. The token will always have one 
character. 

When Token_and_Type returns the token type to the 
main program, a case statement will either call a procedure 
or do the processing with a compound statement. The blanks 
will be translated to Latin ASCII code blanks. The returned 
comment token will be written out as is. Literal strings 
are written out literally. Reserved words are written as is 
using the Match_Index in the Res_Word constant array. The 
identifiers are looked up in the symbol table. iets 
predefined, the token identifier number is returned with it, 
or else the identifier is inserted in the table and given an 
identifier number. The module used is called Map _Iden_To 


Latin. 


6. Map_Iden_To_Latin 





whe tdentifier token iS received and searched for 
with a procedure called Search. 
ie Search 


This module starts by calling the Hash_Key function. 
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a. Hash_Key 
Hash_Key calculates the token key_no with a hash 
Eeornnula. The key number is used to look up the pointer of 
the word record in the hash table. The word record is a 
linked list of identifiers of the same key number. All the 
pointers are initialized to nil at the beginning of the 
program. If the key number results in a nil pointer value, 
that means there is no such word in the symbol table, nor. 
any other word with the same hash key number, then Search 
calls Insert to insert the identifier in the symbol table. 
b. insert 
Insert creates a word record at the beginning of 
the linked list and stores the identifier in the spelling 
table. Insert makes a call to IDEN_LBL_NO, which uses the 
global variable ID_NO (sequence of appearance), and assigns 
an identifier number in the word record. 
If the pointer is pointing at a word record, then the first 
word in the linked list is checked, and so on until there 
are no more word records in the list or the word is found. 
Cc. Eound 
The function Found checks if the resulting 
pointer is pointing at the exact identifier spelling. 
If the word record is found then it already has been 
assigned a specific identifier number which is then passed 
back to the main program to be written out as the Latin 


Ldenti fives. 
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Se Latin_int 





The procedure maps each digit of the token to the 
Latin digit 0...9, and passes back the Latin integer. 


9. Get _Latin_Spec_Char 





The procedure is to give each Arabic’ special 
character its Latin "functionally" equivalent character. 
10. Print Errors 
Based on the error set, Print Errors will send the. 
error type and the line number in the source code where it 


was encountered. 


E. PROGRAM DIRECTIVES 

The program offers two directives. One is the option to 
keep the source comments in the output file, or the program 
will omit the comments by default. Two is the option to 
turn on and off the debug option at any location in the code 
at the beginning of a line. This option will display the 
tokens and their types as they are scanned. 

The program is demonstrated by a list of test runs to 
verify the translation of reserved words and special 
characters. Also a sample of small PASCAL programs are 
included with their generated files, code and dictionary 


tables (Appendix I). 


E. LIMITATIONS 
The program does not allow the user to use the 'Include' 


directive in TURBO PASCAL. The size of the program is 


Git 


limited by TURBO PASCAL to 64k, where an additional code 
could be included as an 'Include' file. 

The program is set to handle up to one thousand 
identifiers. This is a reasonable number in working with 
TURBO PASCAL since the program size is limited to 64k bytes. 

The spelling table is 5000 characters long. That means 
the total length of all identifiers can not exceed 5000 
characters. The programmer can avoid, when writing long- 
programs, exceeding the limit by using short identifiers. 

The program will not generate an error flag if a Latin 
string is found in comments or literal string. This is 
because both comments and literal strings are not altered. 

ARCII provides two commas. The numeric comma is used 
with real numbers in Arabic, and the Arabic Comma is used, 


in this specification, as the Latin comma except for the 


real number case. This is a small hurdle in the case of 
translating the generated code back to Arabic. The 
appearance of the two Arabic commas is different. They are 


180° out of phase on the vertical axis where the numeric 
comma looks like the Latin comma. The decision on using 
both commas was to avoid overloading the use of the Arabic 


comma. 


Te 


VII. CONCLUSION 


This thesis has tried to narrow the gap between educated 
Arabic-speaking people and computers in general. The target 
ages are mid-teenage, and forty-five and above. The 
majority of these two classes still look at computers as 
magic. They believe man created them. However they have a. 
hard time believing that man tells computers what to do. 
With that attitude, the only thing that can convince them is 
to help them to write small programs and see the results. 
We are convinced that the majority will get rid of their 
fear and have the desire to explore this machine. 

In short, the topic of the interface between the rich 
Latin software library, and the Arabic language environment 
iS a promising area in the sense that it will bring those 
who ee computers closer, and find a more efficient way to 


get the job or hobby done. 


mee CONCEPT FUTURE 

The program is simple in concept and to code, but the 
environment where it is expected to work is not yet 
standardized. The standards are not widely implemented, nor 
are the developers of bilingual operating systems very 
helpful in responding to concerns about hardware 


compatibility. 
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Once a unified environment is established, then the 
concept could be developed further. The goal of this work 
was to illustrate the feasibility and avoid specific issues 
of the implementation environment. The program modules were 
designed to be adaptable and portable for several purposes 
with little modification. For example: 

- For several programming language translations, such as 
"c," FORTRAN, and BASIC, we only need several resource 
files and several special character sets, one for each- 
programming language requirement. 

- For several code sets, including different languages, we 
need the concept of a bilingual operating system that 
uses the upper range of the character set ranging from 
SQ ..4 Br hex. 

- The program can work in a Latin-only operating systen, 
to translate source codes that have been edited using 
Arabic code set values. Also, the generated source 
could be compiled in the same machine. If the program 
is interactive, then it needs to run under a bilingual 
operating system. 

B. LIMITATIONS 

The bilingual operating system was not well documented 
as far as how some of the function codes are implemented 
during editing. Some of the characters have two codes (such 
as the Arabic multiply sign and the numeric multiply sign). 
To know which multiply sign is generated when I strike a 
key, I had to use an editing tool to display the code in Hex 
values and match the text file and its Hex values. 

Right indentation is relative to the editor mode. fide 


you select your screen mode to be Arabic and you read a 


piece of Latin code, it will be right justified. 
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The user must be careful reading data files. Some data 
is readable only in Arabic mode and some data is readable 
only in Latin. Also the data displayed may have been 
transformed by the operating system. As mentioned before, 
the user could use the "SWAP" option for altering ASCII 
digits and ARCII digits in the DOS environment, or read the 
digits as a string and change the values into ASCII. This 
is important in order to perform numerical operations with. 
Arabic digits. 

I strongly believe that, with time, standards will be 
developed with more care and concern for the user. This is 
the reason we chose not to design the program for a specific 
system. 

It is hoped that this work will benefit other 
researchers and future thesis students from other countries 
since a similar concept could be applied to other languages, 


especially languages descended from Latin. 
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APPENDIX A 


FIGURES 


Figure l. The 28 Arabic Alphabets 


Figure 2. The 31 Alphabets (Optimum Set) 
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NAME CHARACTER 


A 
ae 
D 
D 
PELL LENA me OL Ll — 


SEEN “ 


HAMMAZAH 
TAAMARBOTA 
ALEF_MAQSURA 


G orn 


Figure 3. Arabic Alphabet Names 


qe 


CHARACTER 


Ge oe¢rwCcabkamrnmn &oC 


Figure 4. Arabic Diacrities 4 Vowelwzaciven) 
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leew Ye (Glee c Vue] 


Eastern Hindu numerals 


a - e  ee e Bs 1 


Western Hindu numerals 


Pigumae ss. 8hindu Numerals 


dQ 


bis} deuwall ghas aball paid gall Gig dhl ots 
secu! 138 col) Gadd ol KI jad 13a Alall as yaks 
spdla Gyn Goluell ene Gilfall alll eens ol Sel 
alll ait saci Ga 
yell ge caitlh gldll aaa wale yell hal 
cassis yalall 

ad. ULeENOUE wVewels 
‘atetat wali’s “acai “ott, UGA fam 
V3fs soll a, “cols et, ae ee 
Sah sata‘a'l Wa gl) “lass “7 1) ae 
Wolltatl *Accs 2c ar 
lel “ope tlk Ceres “yp hs “Go 
sohll oe “oad Gell Seuaeneeee — As 


b. VilGhe Vowels 


Figure 6.) Araimes sere 
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sl Operating 
(en 
Tha — ae 
co tes BCON reduced 


codes apa 
key codes ap plications 


Keyboard SS 





Figure 7. BCON Code Sets 


81 


APPENDIA 


TEXAS INSTRUMENTS APPROACH TO 
BILINGUAL OPERATING SYSTEM 


Philosophy of Bilingual Arabic 
Latin Implementation on Microcomputer 
System 


Texas Instruments 
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ARABIC COMPUTER SYSTEMS 
ian SO Pr Y 


SPECIFIC CHARACTERISTICS OF THE ARABIC LANGUAGE 


* ARABIC IS WRITTEN FROM RIGHT TO LEFT 


* THERE ARE SOME VARIATIONS IN TYPES OF ARABIC CURRENTLY IN USE IN 
DIFFERENT COUNTRIES 


* THE LANGUAGE IS A FOUR LEVEL ONE. A CHARACTER CAN HAVE UP TO 
FOUR SHAPES DEPENDING ON ITS POSITION IN THE WORD : ISOLATED, 
INITIAL, MEDIAL OR FINAL 

* ARABIC CHARACTERS ARE JOINED WITHIN A WORD 


* NO UPPER CASE EXISTS IN ARABIC 


ill start my presentation by a brief mentioning of some of the characteristics of the 
Arabic language which have been covered in previous papers and which affect the use of 
the Arabic language in the computer field. 


me 
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ARABIC COMPUTER SYSTEMS 
PHILOSOPHY 


SPECIFIC CHARACTERISTICS OF THE ARABIC LANGUAGE 


ONLY THREE CHARACTER VOWELS EXIST IN ARABIC : 
ALIF | ,OQUAOQU 9 , YAA 


VOWELISATION IN ARABIC IS ALSO PERFORMED THROUGH THE USE OF 
DIACRITICS. THESE ARE USED : 

— IN THE CASE OF SIMILARLY WRITTEN WORDS TO AID THE READER 
- IN RELIGIOUS TEXTS INCLUDING THE KORAN 

- FOR SCHOOL TEACHING 


ARABIC LANGUAGE USES INDIAN NUMERICS, WITH THE DECIMAL POINT 
BEING A COMMA. 


THERE ARE ARABIC SPECIAL CHARACTERS WHICH INCLUDE THE ARABIC 
COMMA «¢ ,SEMICOLON ¢ , QUESTIONMARK ¢_ , ETC. 
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ARABIC COMPUTER SYSTEMS 
megiLOsoOrr y 


ARABIC ALPHABET 


THE BASIC ARABIC ALPHABET IS COMPOSED OF 28 CHARACTERS 


THE LAMALIF WHICH IS COMPOSED OF TWO CHARACTERS LAM + ALIF 
IS CONSIDERED AS ONE CHARACTER 


THE HAMZA CAN BE WRITTEN IN MANY DIFFERENT WAYS IN ARABIC 
DEPENDING ON ITS USE, WITH A VOWEL OR ISOLATED 


IF THESE TWO CHARACTERS ARE TAKEN INTO CONSIDERATION THE 
ALPHABET IS 30 CHARACTERS 


THE TAMARBOUTA IS A SPECIAL CHARACTER NOT INCLUDED IN THE 


ALPHABET. IT IS OCCASIONALLY INCLUDED AT THE END OF WORDS 
DEPENDING ON GRAMMATICAL RULES 
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ARABIC COMPUTER SYSTEMS 
BILINGUAL SYSTEM APPROACH 


SOLUTION 1 : CORRESPONDANCE & DIFFERENCES 


* THIS STUDY IS BASED ON THE CORRESPONDANCE AND DIFFERENCES 
BETWEEN ARABIC CHARACTERS. THE ARABIC ALPHABET MAY BE CONSIDEREL 
AS FORMED OF THREE TYPES OF CHARACTERS : 


— TYPE A INCLUDES CHARACTERS HAVING 1, 2, OR 3 POINTS : 
Iwo: 

—- TYPE B INCLUDES CHARACTERS WITHOUT POINTS : 
ser d J 

- IOV EIe C INCLUDES CHARACTERS HAVING AT LEAST ONE FORM IN EACH CASE 
DOI Pes ewr/TT Ts 


* IF WE ONLY CONSIDER THE FORMS WITHOUT POINTS WE CAN REDUCE THE 
CHARACTERS IN EACH TYPE AND THEN ADD THE POINTS AFTERWARDS 
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ARABIC COMPUTER SYSTEMS 
BILINGUAL SYSTEM APPROACH 


SOLUTION 2 : ROOTS & APPENDICES 


* A STUDY BASED ON THE USE OF APPENDICES AND ROOTS TENDS TO 
REDUCE THE TOTAL NUMBER OF SHAPES BY CONSIDERING A ROOT TO BE 
USED IN INITIAL & MEDIAL SHAPES TO WHICH] AN APPENDIX 1S ADDED TO 


FORM THE FINAL OR ISOLATED SHAPES 
TYPE A  TYPEB TYPEC 


meg a ut = >» pe o= + 9 
pe vo =» tao o =o + 9 
? = ce w = & + y 


The problem with this solution is what code to give to these apprendices, if they are coded would 
they be considered as characters in a character count? How would high level languages interpret them? 
How would special s/w function interpret them? replace — insert — find string. 

This is the study which resulted in the actual Arabie implementation on Texas Instruments equipment 
and which will be explained in this paper. 
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he 


' ABABIC COMPU TERS Sins 
BILINGUAL SYSTEM APPROACH 


SOLUTION 3 : CONTEXTUAL ANALYSIS 


A STUDY BASED ON THE USE OF SHAPING ALGORITHMS. USING CONTEXTUAL 


ANALYSIS TO DETERMINE THE PROPER SHAPE OF THE CHARACTER, FOUR 
GROUPS ARE IDENTIFIED 


— GROUP 1 ONE SHAPE PER CHARACTER 

— GROUP 2 TWO SHAPES PER CHARACTER 
— GROUP 3 THREE SHAPES PER CHARACTER 
— GROUP 4 FOUR SHAPES PER CHARACTER 


POSSIBLE APPROACHES 


— ONE-KEY ONE-SHAPE SIMPLIFIES THE SOFTWARE BUT USUALLY LIMITS THE 
SET OF ARABIC CHARACTERS AND CREATES A COMPLEX KEYBOARD SINCE 


ALL THE ARABIC CHARACTER SHAPES MUST BE PRESENT ON THE 
KEYBOARD. 


— ONE-KEY MANY-SHAPES IMPLIES MORE SOPHISTICATED SOFTWARE BUT 
SIMPLIFIES KEYBOARD & USER INTERFACE 


Ol these 2 approaches the 2nd one has been chosen aid this will be covered in the following slides. 


8 8 


t+ + + 


+ 


See EN pr xK Cc 


DS9900 BILINGUAL COMPUTER 
SYSTEM BY TEXAS INSTRUMENTS 


ARABIC COMPUTER SYSTEMS 
DS9I9O BILINGUAL SYSTEM 


COMMERCIAL COMPUTING REQUIREMENTS FOR THE MIDDLE-EAST 


BILINGUAL LATIN/ARABIC DATA INPUT & OUTPUT 
COBOL DRIVEN APPLICATIONS 

BILINGUAL PRINTING 

BILINGUAL SORT/MERGE 


SPECIAL PRODUCTS DEVELOPPED TO MEET REQUIREMENTS 


BILINGUAL DATA ENTRY TERMINAL 
BILINGUAL MATRIX PRINTER 
BILINGUAL LINE PRINTER 


SOL ETWARE 


These handle both im the natural manner + software stmplitted kK7w for operators + high level 
lingtuages easy handling. 


89 


ARABIC COMPUTER SYSTEMS 
DSIIO BILINGUAL SYSTEM 


CHARACTERISTICS OF BILINGUAL DATA ENTRY TERMINAL 


* BILINGUAL VIDEO DISPLAY UNIT 


—- THE CHARACTER GENERATOR ROM GENERATES 7 x 8 MATRIX FOR ALL 
STANDARD ASCII CIIARACTERS AND 128 ARABIC SHAPES 


A 7 x 10 MATRIX IS USED FOR INTRICATE ARABIC CHARACTERS 


Latin & Arabic can be displayed on the screen at the same time. 
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ARABIC COMPUTER SYSTEMS 
DS990 BILINGUAL SYSTEM 


* BILINGUAL KEYBOARD 





* Basic placements of Arabie key like typewriter, 


PROVIDES 5 MODES OF OPERATION : ARABIC, LATIN, SHIFT, UPPERCASE 
& CONTROL, IT CONSISTS OF 91 KEYS 


PROVIDES THE USER WITH THE CAPABILITY OF ENTERING ARABIC 
AND/OR LATIN DATA WITHOUT CONSTRAINTS 


KEYBOARD MULTIFUNCTION CAPABILITY IS PROVIDED BY A MODE 
SELECTION KEY AND TWO CHARACTER SET SELECTION KEYS 


DATA IN EITHER LANGUAGE CAN BE ENTERED IN EITHER MODE 


THE KEYBOARD GENERATES 7-BIT CODES FOR LATIN AND 8-BIT CODES 
FOR ARABIC 


UPPER (a See i © oy \ 
CASE 
LOCK 2 @ sels S}e 3} Cie |e ee 
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ARABIC COMPUTER SYSTEMS 
DSIGO BILINGUAL SYSTEM 
ARABIC CHARACTER SHAPING 


* 32 BASIC ARABIC CHARACTERS ARE GENERATED BY THE KEYBOARD 

* A CONTEXTUAL ANALYSIS OF THE ARABIC DATA IS PERFORMED BY THE 
CONTROL PROGRAM TO DETERMINE THE CORRECT SHAPE OF THE 
CHARACTER TO BE DISPLAYED 

* IN TOTAL THE. TERMINAL CAN DISPLAY 115 SHAPES 


EXAMPLE OF SHAPING PROCEDURE 


iVivt fit, Wut ¢ rts 
ENTER YAA : uw 
TA : Pare) 
KAF : AI 
LAM - Jes 
MIM 4 petSas 


SPACE : peta ve 
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ARABIC COMPUTER SYSTEMS 
DSIIO BILINGUAL SYSTEM 


DEVICE SERVICE ROUTINE INTERFACE OVERVIEW 


* THE DEVICE SERVICE ROUTINE IS CONTROL SOFTWARE BETWEEN THE 
USER’S PROGRAM AND THE VIDEO DISPLAY TERMINAL (VDT) 


DISPLAY 


ROM 
PROGRAM 

VOT OUTPUT INTERFACE 

en INTERFACE 


Gea 
a 






ARABIC 


ARABIC A> att 
OSR BUFFER 


VOT 
CONTROLLER 


; ae ‘ it) Jt etibteat, 
" Soltware HW 
Li/W 


System flexibility by Soltware implement, 
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ARABIC COMPUTER SYSTEMS 
DS990 BILINGUAL SYSTEM 


BILINGUAL TERMINAL PROGRAM INTERFACE 





* Basic Character Set. 


EET = 
Sean nn 
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ARABIC COMPUTER SYSTEMS 
DSIIO BILINGUAL SYSTEM 


BILINGUAL TERMINAL DISPLAY ROM INTERFACE 





Problems of Arabic Numerics must use ASOTT numeric code for COBOL, FORTRAN. 
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APPENDIX D 


BCON BILINGUAL OPERATING SYSTEM BY ALIS INC. 
eee tbs bt Ablo INC. 


Default Reduced Codes 


[adn characters idenocal to orgmal VeCll set with tne excepuon at the 
rollowing two characters 

Function code set Bilingual screed Cperanng Mode (imaged as Latin space) 
Function code set Latin-Oaly screen Operaurng Mode (Imaged as Latun 


space} 


Numeric® space 

Arabic*™ number sign 

Numenic multiply sign 

Arabic ampersand sign 

Arabic apostrophe sign 

Numeric percent sign 

Numeric divide sign 

Numeric lett parenthesis 

“umeric right parenthesis 

Numeric plus sign 

Numeric minus sign 

Numeric less than sign 

“Numeric equals sign 

Numeric greater than sign 
Function code set Arann Screen | anguaze \lode (imaged as Arabic space) 
Function code Se? Latin Screen j.anguaye \lode (imaged as Arabic space) 


Function code det Araby Line Language \tode (imaged as Arabic space) 
Function code Set Latin Line language \ioge (imaged as Arabic space) 
Arabic commercial at sign 


(3 
Arabic lett square bracket 
Arabic right square bracket 
Arabic upward arrow head 
Arabic underline 
Arabic reverse apostrophe 
Arabic lett curly bracket 
Arabic vertical line 
Arabic night curly bracket 
- Arabdic tilde 
(reser\ ed) 
(reserved) 
Function code set Line Boundary (imaged as Arabic space) 


(reserved) 





(*); Numeric means character is Arabic but has intrinsic right 
spacing (1.e. will be considered part of a numeric string). 
m*): Arabic means character has intrinsic left spacing. 
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A\ramic space 

\rabie exclamation mark 
Aradic Quotation mark 
Aram muiogphy sign 
Arabie dollar siz 
Arabic percent sign 
Arabic period 

Arabic divide sign 
Arabic lett parenthesis 
Arabic night parenthesis 
Arabic asterisk 

Arabic plus sign 

Arabic comma 

Arabic minus sign 
Numeric cOmma 


Arabic solidus 


Arabic digit 
Arabic digit 
Araftic digit 
Arabic digit | 
Arabic digit 
Arabic digit > 
Arabic digit 
Arabic digit 
Arabic dipit 
Arabic digit 
Arabic colon 
Arabic semi-colon 


Arabic greater than sign 





Arabic equals sign 


Arabic less than sign 





Arabic question mark 








og 





TAI 

KASHIDA 

SHADDAH 

SUKLN 

FAT HA 

SHADDAH FAT'HA 
FATTHATAN 
SHADDAH FAT'HATAN 
DAMMAH 

SHADDAH DAMMAH 
DAMMATAN 
SHADDAH DAMMATAN 
KASRAH 
SHADDAH KASRAH 


KASRATAN 
SHADDAH KASRATAN 


HAMZAH 

ALEF 

WASLA ON ALEF 
HAMZAH ON ALEF 
HAMZAH UNDER ALEF 
MADDAH ON ALEF 
BA‘A 

PEH 

TAA MARBUTA 
TAA 

THA’A 

JEEM 

SHEEM 

HA‘A 

KHA‘A 

Ear 
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3 
Sj 
Ss; 
5 
uw 
A oo 
S ad 
TS od 
a 
do 
S 
c 
re | 
eS) 
3 
oo 


7GCO Gwe ©0O FW AN Get EO 


[LAN 

LAMALEF 

WASLA ON LAMALEF 
HAMZAH ON LAMALEI 
HAMZAH UNDER LAMALEF 
MADDATI ON LAMALEF 
MEEM 

NOON 

HA 

WAW 

HANZAH ON WAW 
ALLE MAQSURA 

YA A 

HAMZAH ON YA'A 


Arabi reverse solidus 





Blank “FF character (imaved as Arabic space) 
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Key Codes to Reduced Codes Table 


Kev code* | English | Reduced Code {| Arabic Arabic 
(ASCIL) Legend (ARCID) Legend Name 


Arabic space 
Arabic ' 
Arabic " 
Arabic = 
Arabic $ 
Arabic °o 
Arabic & © 
TAH 
Arabic { 
Arabic ) 
Arabic ° 
Arabic + 
WAW 
Arabic - 
ZAIN 
DHAH 


tL — 


iw NM tT ty to J 
w 


are -~ FON © 


o> 


vee) 
t 


Dy 
rs 
2 > 
2G 3 
2 
9 
& 
= 
ao 


v1 


Arabic 0 
Arabic 1 
Arabic 2 
Arabic 3 
Arabic 4 
Arabic 5 
Arabic 6 
Arabic 7 
Arabic 8 
Arabic 9 
Arabic : 
KAF 
Arabic numenc comma 
Arabic = 
Arabic 
Arabic ” 


Oo mr ID Ur & WG TD — 
2 >< $0 4 3s 





(*): Character byte of key code word only (low-order bvte). The 
scan code (high-order byte) is not modihed by BCON. 


100 


Arabic @ 

KASRAH 

MADDAH ON LAMALEF 
Arabic 

Arabie [ 

DAMMAH 

Arabic | 

HAMZAH ON LAMALEF 
HAMZAH ON ALEF 
Arabic divide sign 
KASHIDA 

Arabic comma 

Arabic / 

Arabic * 

MADDAH ON ALEF 
Arabic multiply sign 


Arabic semi-colon 
FAT'HA 

DAMMATAN 
KASRATAN 

HAMZAH UNDER LAMALEF 
Arabic 

Arabic ; 

FAT'HATAN 

SUNKUN 

HAMZAH UNDER ALEF 
TAIL 

JEEM 

Arabic \ 

DAL 

Arabic — 

Arabic _ 





THAL 

SHEEN 
LAMALEF 
HAMZAH ON WAW 
YAA 

THA A 

BA A 

LAM 

ALEF 

HA 

TA‘’A 

NOON 

MEEM 

TA’'A MARBUTA 
ALEF MAQSURA 
KHA‘A 

HA‘A 

DAD 

QAF 

SEEN 

FA 

AIN 

RA 

SAD 

HAMZAH 
GHAIN 
HAMZAH ON YA’A 
Arabic > 

Arabic | 

Arabic < 
SHADDAH 


Kev code English | Reduced Code | Arabic Arabic 
(scan - char) | Legend (ARCID Legend Name 


100 : PEH 
1800 ; Pen 
1900 t SHEEM 
1900 , SHEEM 
2500) a SEEM 
2500 cade = SEEN 
2600 GAF 
2600 : : GAt 


5 
2 
cb) 
S 
ws 
ead 
J 
Oo 
ies 
v 
. 
z 
S 
C 
ke 
Sod 
3 
ad 
wd 
© 
~ 
Ww aad 
& 
c 
S 
> 
< 





HOw 


-_—— — _ mm mr a a a re ny ee ee ee + eee 


| qa. BCON Keyboard Layout Version 3 











eae, 


es 


e853 08) aret CSIC ES, LELJELUF 





































| 
a a 
: alee cee Pele ho ieha re icKalea) 
pial Ne ehehelatetelerscrele Teese | 
We MELE: ie 0) Se 08 IEBOOUE TSEEE Ul | 
| alee eC EL ELL 
Ss 


ee ee ees 
—— + =e 


Keyboard Layout and Keycodes to 
Reduced Codes Tables 


ie 


Display Codes 


Display Reduced Name ishape) (*) 
code code 


Q to FF 0 to FF Latin cnaracters, identical to Original ASCII! set. with the 
exception ct the tollowing two characters: 
OF. Function code OF 
OF Function code OF 


SHADDAH 

SUKUN 

FAT HA 

SHADDAH FAT'HA 
FAT'HATAN 
SHADDAH FAT'HATAN 
DAMMAH 

SHADDAH DAMMAH 
DAMMATAN 
SHADDAH DAMMATAN 
KASRAH 

SHADDAH KASRAH 
KASRATAN 


SHADDAH KASRATAN 
Arabic visible space 
Arabic visible boundary 


SE Function code 8E 
BF Function code 8f 
90 Function code 90 
9] Function code 9] 
9E Function code 9E 
O0O(**) (Reserved) 

00 (Reserved) 

00 (Reserv ed) 

00 (Reserved) 

0 (Reserved) 

00 (Keserved) 

00 (Reserved) 

00 (Reserved) 
(Reserved) 
(Reserved) 
(Reserved) 





(*), A means Alone. F means Fina!, | means Initial and M means Medial. 


ALBA 


OO means that this display code 1s reserved and that no reduced code 1s 
associated to it bv detault. 


Display Reduced Name ishape) (*) 
code code 


(): 


(reds 


Arabic 
‘Arabic 
Arabic 
Aramie 
Arabic $ 
Arabic 
Arabic 
Arabic 
Arabic 
Arabic 
Arabic 
Arabic 
Arabic , (numeric comma) 
Arabic - 
Arabic . 


Arabic 


Arabic 
Arabic 
Arabic 
Arabic 
Arabic 
Arabic 
Arabic 
Arabic 
Arabic 
Arabic 
Arabic : 
Arabic ° 
Arabic 

Arabic 

Arabic ) 
Arabic 


NST WH Wh — 


cuz 





A means Alone F means Final, | means Initial and M means Medial. 


OU means that this displav code is reserved and that no reduced code is 
associated to it by detault 


OS 


Display | Reduced Name (shape) (*) 
code code 


Arabic (@ 

HLANIZ AFI 

NRE! 

Veo OS ALT 
HAMZAH UNDER ALEt 
PEH 

Por 

TAA MARBUTA 
TAA 

TAA 

THA A 

THA A 

JEEM 

IEEM 

SHEEM 

SHEE 


— 


A 
| 
A 
A 
| 
A 
| 
A 
| 
A 


HAA 

HA A 

KHA’A 

KHA‘A 

DAL. 

LAMALEF 

WASLA ON LAMALEF 
HAMZAH ON LAMALEF 
HAMZAH UNDER LAMALEF 
MADDAH ON LAMALEF 
MEEM 

Arabtec | 

Arabic ‘ 

Arabic } 

Arabic — 

Arabic a 


»-PPrrrrr r+ > 





(9) A means Alone. F means Final, | means Initial and M means Media! 


@*) OO means that this display code 1s reserved and that no reduced code ts 
associated ton bv default, 
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Display | Reduced Name ishape) (*) 
code code 


Arabic - 

MEEM 

NOON 

NOON 

HA 

Arabic text comma 

Arabic x (multiply sign) 
Arabic divide sign 
HAMZAH ON ALEF 
THAL 

Arabic > > 

Arabic << 

SEEN with compressed tail 
SHEEN with compressed tail 
SAD with compressed tail 
DAD with compressed tail 


Numeric space 
Numenic x (multiply sign) 
Numeric % 

Numeric divide sign 
Numeric ( 

Numenic ) 

Numenic + 

Numenic - 

Numenic ( 

Numeric = 

Numenic >) 

Arabic . 

Arabic | 

Arabic - 

Arabic — 

Arabic (DELETE sign) 





(*) A means Alone, F means Final, | means Initial and M means Medial. 


r*) 00 means that this displav code 1s reserved and that no reduced code 1s 
associated to it bv default. 


Or? 


Display | Reduced Name (shape) (*) 
code code 


SHADDAH (inking) 
SLMCN (linking) 
LATHA (linking) 
SHADDAH FAT HA (linking) 
FAT HATAN (linking) 
SHADDAH FATHATAN (linking) 
DAMMAH (linking) 
SHADDAH DAMMAH (linking) 
DAMMATAN (linking) 
SHADDAH DAMMATAN (linking) 
KASRAH (linking) 
SHADDAH KASRAH (linking) 
KASRATAN (linking) 
SHADDAH KASRATAN (linking) 
TAIL 

KASHIDA 


ALEF 

WASLA ON ALEF 

SEEN with compressed tail 
HAMZAH ON ALEF 
HAMZAH UNDER ALEF 
MADDAH ON ALEF 
MADDAH ON ALEF 


BA‘A 

BA'A 

BA‘A 

BA‘’A 

PEH 

PEH 

TAA MARBUTA 
TAA 

TAA 





™): A means Alone, F means Final, 1 means Initial and M means Medial. 


f°"); OO means that this display code is reserved and that no reduced code 1s 
associated to it by default. 
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Display Reduced | Name (shape) (*) 
code code 


THIA ‘\ 
THA A 
WEN 
IEEM 
SEEN 
SHEEM 
HAA 
HAA 
KHA A 
KHA'A 
DAR 
SHEEN with compressed tail 
THAI 
RA 

RA 
ZAIN 


ZAIN 
SEEM 
SEEM 
SEEN 
SEEN 
SEEN 
SEEN 
SEEN 
SHEEN 
SHEEN 
SHEEN 
SAD 
SAD 
DAD 
DAD 
TAH 





(7) A means Alone, F means Final, | means Initial and M means Medial. 


*) OO means that this display code is reserved and that no reduced code is 
associated to it by detault. 


Og 


Display | Reduced Name (shape) (*) 
code code 


TAH 
DH AH 
DH AFI 
AIN 
LIN 
AIN 
AIN 
GHAIN 
GHAIN 
GHAIN 
GHAIN 
FA 

FA 

FA 

FA 


QAF 


OQAF 
QAF 
QAF 
CAF 
CAF 
CAF 
CAF 
GAF 
GAF 
GAF 
GAF 
LAM 
1AM 
LAM 
LAM 
LAMALEF 





(*y A means Alone, F means final, | means [Initial and M means Medial 


/ OO means that this display code ts reserved and that nu reduced code 1s 
associated to it by default. 


ee 


Display | Reduced Name ishape) (*) 
code code 


Ce). 


Raclad) 


WAS CON LAMALEF 
HAMZAH ON LAMALE! 
HAMZAH UNDER LANMIALEF 
MADDAH ON LAMALFE 
MEEM 

MEEM 

NOON 

NOON 

HA 

HA 

HA 

WAW 

WAW 

HAMZAH ON WAW 
HAMZAH ON WAW 
ALEF MAQSURA 


ALEF MAQSURA 

YAA 

YA‘'A 

YA‘’A 

YA‘A 

HAMZAH ON YA‘A 

HAMZAH ON YA’‘A 

HAMZAH ON YA‘A 

HAMZAH ON YA’‘A 

ALEF (for LAMALEF) MF 

WASLA ON ALEF (for WASLA ON LAMALEF) M 
HAMZAH ON ALEF (for HAMZAH ON LAMALEF ) MF 
HAMZAH UNDER ALEF (for HAMZAH UNDER 
LAMALEF) MF 

MADDAH ON ALEF (for MADDAH ON LAMALEF) MF 
SAD with compressed tail f 

DAD with compressed tail 5 





A means Alone, F means Final, | means Initial and M means Medial. 


QO means that this display code is reserved and that no reduced code 1s 
associated to it bv default. 
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APPENDIX E 


CODAR I, II, U CODE SETS 


Seven bit CODAR IU 


7 bit 
CODAR II 





a 
= 
~ 
ia 


< 


sie a © 
| 5/58) 





CODAR II coding compatible with CCITT Nr. 5. The set coded is the sub-svstem ASV-CODAR/1 
comprising 64 characters for informatics and data transmission. It was presented at the 
UNESCO/IBI Conterence at Bizerte. 1976. The ASV-CODAR/2 sub-system can be obtained by 
eliminating the characters framed in heaw lines. 
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Seven bit CODAR U 


| 


= 
=) 
-_ 
= 
=) 
3 
fom} 


COOAR-U 


-%O7? 


COamtnw 15-6 


icue 





eles 


APPENDIX F 


FINAL \CODE Ua 2. 


FINAL CODE 
CODAR U-F.D. 


Recommendation of the final 


Meeting Held In Rabat (Morocco) 
In 22-24 April 1982 


< alee 


FOREWORD 


The importance of the role of the information channels inthe Arabic world is becoming increasingly 
obvious in all sectors. All Arabic countries are dealing with various types of information in the fields of 
administration organization, planning, science and technology. . 


The simple concept of cooperation between the Arab countries, and the positive results of 
standardization make It necessary to introduce a unified cipher for the Arabic characters used in the field 
of information exchange. 


In this connection the concerned Arabic organization have taken considerable measures such as the 
two meetungs which were held in Rabat( Morocco); the first meeting was the (Arab experts conterence tor 
the unified Arabic cipher in the field of information). [t was held with the cooperation of the (Arabic 
Institute for Researches and Arabtzation) during the period between 25th-29th Sept.. 1980. The second 
meeting concerned with the regulation of the Arabic cipher in its final shape and was held on April 22-24. 
1982. In this meeting the technical committee did achieve the projected corrections, and the Arabic 
cipher which is known as (CODAR U.F.D.) was ready. 


Attached are the reasons for modification of the COAR-UF.D.. the recommendations adopted at 
the meetings and the final shape of the unified Arabic cipher which will be formed in an Arabic standard. 
This standard will be distributed to the ASMO member bodies for further studving and approval as a 
prelude to the actual experimentation and application. 


RECOMMENDATION 


In the final session and with a group agreement of the conferees on the final shape of the unitied 
Arabie cipher. the following recommendations have been adopted: 


(1) The conference requests the Arab League Education Culture and Science Organization 
(ALECSO) and Arab Organization for Standards and Metrology (ASMO) to adopt the 
Arabic cipher which has been ugreed upon, and take all necessary measures tor its adoption 
and enforcement in all Arabic countries. 


(2) The conferees recommend to the information organization that use Arabic language to 
experiment the new cipher betore entorcement. 


These recommendations shall be submitted in particular to the (Institute for Research and 
Studies for Arabization) in Morocco. the Saudi Arabian Standards Organization and the 
National Center tor Informauon in Tunisia tor the purpose of testing the new cipher betore 
the next (ASMO) meeting. 


(3) It is recommended that the Arabic cipher in its new and tinal share be adopted by the Arabic 
association tor telecommunications. : 


(4) It is also recommended that ALECSO, the ASMO and the Arab association tor 
telecommunication shall make necessary coordination to use Arabic language tn the fleld of 
Intormation between them and other international organizations bodies and the UNESCO. 

(5) The meeting recommends an emergency session ALECSO and ASMO to regulate the 


specifications of the devices, the printing letters and their forms and to tind the best way of 
utilizing computers. 


(6) The meeting also recommends the continuous contact between ALECSO and ASMO to see to 
the best execution of these recommendation. 


AS 


CODAR - U/FD aig Ge 


Codage arabe unifie forme definiuve dye)! Wa ped pd sdemgell aad! 5 ott 
(RABAT 22 - 24 Avril 1982) (1982 dyf 24. 22 BL») 


oto fo fo jo fi fF ff 
Ci See ee 





Réunion Alecso - Asmo 


sur la mise au point et la normalisation Jar 5a) 4 pl éyieal!l bad lyn 
du Codar - U. Gok (pa tabyaisy 


116 


er Oe eG 


ASMO'S APPROVED ARAB STANDARD SPECIFICATIONS 





ARAB STANDARD SPECIFICATIONS 
449 


Data processing - 7 - bit coded Arabic Character set for Information Interchange 


ARAB LEAGUE 
ARAB ORGANIZATION FOR STANDARDIZATION 
AND METROLOGY ( ASMO ) 


ili Ug 


Preface 


This Arabic Standard was prepared by technical committee No. 8 (Arabic characters in informatics). 
Among the parties who participated in its preparation are the Arab League Educational, Cultural, 
and Scientific Organization (ALECSO), and the Institute of Studies and Research for Arabization in 
Morocco. 


In accordance with the 1982 Directives for the Technical Work of the Arab Organization for 
Standardization and Metrology - Part I: Procedure and Working Methods - this Arabic Standard was 
adopted by the resolution of the General Assembly of ASMO No: 

(R 342/G.A. 7/8 15- October 21, 1982 ). 


Jodha 


DATA PROCESSING: 7-BIT CODED ARABIC 
CHARACTER SET FOR INFORMATION 
INTERCHANGE 


0. INTRODUCTION 


This Arabic Standard specifies the properties of acoded character set using 7-bit binary codes for 
information interchange among different tvpes of data processing equipments using the Arabic 
characters. It also specifies a set of control and graphic characters, in addition to its coded 
representation inspired from ISO 646. The set of specific graphic characters in this standard 
enable us under all circumstances to represent Arabic text whether it is totally vowelized, 
partially vowelized, or unvowelized. This standard provides the possibilities for information 
interchange for special applications, as well as the possibilities for expansion in case of 
insufficiency of the coded character set. This Arabic Standard was made in accordance with ISO 
646, and the following points were modified so that the standard ISO 646 is convenient for 
Arabic usage: 


— Table |. 
— Comments on this table. 


Table | was modified in such a way which permits the usage of the coded character set as a 
Separate group from the Latin character set described in ISO 646 for information interchange, 
and the usage of basic programs in Arabic Language for the purpose of complete Arabization 
when using computers. This table also allows the usage of the coded character set together with 
the Latin character set as in the International Standard ISO 646 because of the correspondence 
between these two standards. 


Applving this standard requires several application standards to be implemented on a cartier 
(magnetic carrier; transmission network, etc.), and these applications are specified in other 
standards. 


1. SCOPE AND FIELD OF APPLICATION 


1] 


This Arabic Standard contains a set of 128 characters (control characters and graphic 
characters such as letters, digits and symbols) with their coded representation. Most of 
these characters are mandatory and unchangeable, but provision is made for some 
flexibility to accommodate special national and other requirements. 


The need for graphics and controls in data processing and in data transmission has been taken 
into account in determining this character set. 


This Arabic Standard consists of a general table with a number of options, notes, a legend and 
explanatory notes. 


This character set is primarily intended for the interchange of information among data 
processing systems and associated equipment, and within message transmission systems. 
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i 


1.6 


This character set is applicable to all Arabic alphabets. 


This character set includes facilities for extension where its 128 characters are insufficient for 
particular applications. 


The definitions of some control characters in this Arabic Standard assume that data associated 
with them is to be processed serially in a forward direction. Their effect when included in strings 
of data which are processed other than serially in a forward direction or included in data 
formatted for fixed record processing may have undesirable effects or may require additional 
special treatment to ensure that the control characters have their desired effect. 


2. INIPLEMENTATION 


eam | 


This character set should be regarded as a basic alphabet in abstract sense. Its practical use 
requires definitions of its implementation in various media. For example, this could include 
punched tapes, punched cards, magnetic tapes and transmission channels, thus permitting 
interchange of data to take place either indirectly by means of an intermediate recording in a 
physical medium, or by local electrical connection of various units (such as input and output 
devices and computers) or by means of data transmission equipment. 


The implementation of this coded character set in physical media and for transmission, taking 
into account the need for error checking, is the subject of other ISO publications. 
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NOTES ABOUT TABLE 1: 


1) The format éffectors are intended for equipment in which horizontal and vertical movements are 


to 
— 


3) 


4) 


effected separately. If equipment requires the action of CARRIAGE RETURN to be combined 
with a vertical movement, the format effector for that vertical movement may be used to effect the 
combined movement. For example, if NEW LINE (symbol NL, equivalent to CR+LF) is 
required, FE2 shall be used to represent it. This substitution requires agreement between the 
sender and the recipient of the data. 


The use of these combined functions may be restricted for international transmission on general 
switched telecommunication networks (telegraph and telephone networks). 


The symbols 74 and locations 2/3 and 2/4 are used respectively to denote NUMBER SIGN and 
CURRENCY SIGN. Note that the character do not designate the currency of a specific country 
unless otherwise agreed upon between the sender and the recipient of data. 


These positions are intended for national use or for alphabet extension. If not used for such 
purposes, they mav be used for representing symbols which do not have specific functions. This 
requires agreement between the sender and the recipient of the data. 


For the general case of information interchange among computers, these positions shall not be 
used. 


Positions and names of special signs which have specific functions in the code table is the same asin 


ISO 646. However, such signs should be imaged and printed according to text as shown in the 
following Table. 
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APPENDIX H 


PROGRAM CODE 


Gaeearanm Lexical Translator (input ,output) : 
(H HEHE LEMHKKARE KE RE SLE KK ARH EAE KER EERE ERE KR ARE MEEN EEE OX) 


Ale 
u 


File Name >: Lexical.pas 
Module name : Lexical Translator 
Author >: Sadek Saleh AL-Juhaiman 


Date created : April 4, 1986 
Last change : Aug 4, 1986 
Calls 

Open File 


Gets the source file name, and 

initialize the Output files. 

To initialize the hash table and global 

Variables. 

Fill the line buffer and increment the 

line no. , | 

Buffer Empty= Check if the line Buffer was consumed. 

Token_And_Type = Get the next token and its type. 

Map Iden To_Latin= Search for the identifier in the 
symbol table. If mot predefined 
then insert it 

Latin Integer = Map integer tokens to Latin integers. 

npecial Character= Map special characters to Latin 
equivalent character. 

Semcrol Char = Notifies the presence of escape 

codes. 


i 


Initialize 


Fill Buffer 


Called by : None 
Include files : Resource.pas 


Variables 


Line =Imput line buffer. 
Next Loc =Foints at the first char of next token. 
Token =KRuftter of 25S character. 


Tok Type =Types of the token present in token buffer. 
Tok Len =The length of the token in token buffer. 
Line No =Source code line number. 
Debug On =Boolean variable, debuqging feature, set 
by Arabic directive in the source cade. 

Comment On= Directive, to include the comments in 

the generated output. 
Array af records for the reserved words. 
cantains the Arabic and its English match. 


ij 


Res Word 


Mateh Ind = Index in Res Word array to token location. 

ies Ci = Integer string of size 10 characters. 

Line = Input line buffer. 

Mext Loc = The first character of the next token in 
the line buffer 

Token = Token buffer. 

Latin _I[d = The mapped identifier ( im Latin form >). 

Hash = HashTable; 

ArabicSpell = Spelling string array of 3900 chars. 


23 


characters = Number of chars in spelling table. 


Line No = Counts the read source lines. 

Line Size = Line buffer upper limit. 

Match _Ind = Index of reserved word found in the 
canstant array. 

Iden_No = The number of the identifier in the 


sequence Of arrival. 


Latin Char = One character butter tor special 
characters. 

Lat Int = The integer translation to Latin. 

Error “Seu = Token error set. a) 


¢ 
Comment ; 

The program will ask for input source fileé@ with or 
without extension . IF the name 15 valid it will open 
the file and initialize tow out put files. The twa 
tiles will have same file name and the extensions DIC 
and FAS. The DIC file has all userdefined identifiers 
With their assigned I[d_Numbers. The FAS tile will have 
the generated FASCAL code. 

After initialization the program will take one 
line and break it to tokens. The token is given a 
type, then based on the type a translation madule will 
be called. 

The above will comtinue for each line of code until 

a major error is encountered. Major error will result 
from long tokens when using comments or literal string. 
} 


(HE HEKKHKRKSSHHHR EHH EHR ERK KKR EMM ERE HHRHKHE RRR HRERKEXM %) 
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Beis \ 


Max Arb Word = 
Max Lat Word = 


Max Len = 
Res Words = 
Maxkey = 6310 
MaxChar = 3000 


ey re 

Line Range 

Arab Word Str 

Patm Word Str 

Word Rec =RECO 
emqi1shn: 
Arabic : 

END; 


Reserved Index 
Words = 


ee { Size of Arabic word > 
ee { size of Latin word a 
Riese) { line & literal size j 
rar { reserved words size 2 
: { Prime mumber, hashing 3 
; { Size of spelling table; 


= O..Max Lens 
stringlMax Arb Ward J; 


c 


{ max char per Latin word 3; 


i 


stringl Max Lat Word]; 
FD © constant array record 3 
{ of reserved words 5 
Pai oY Otecussit s 
Arab Word Strs 


= 1 .. Res_Words; 
array C Reserved Index J OF Word_ Rec; 


Latin _Token:= string [E6ja¢ string in the form id_O900 3} 


Wer cde-ointer = 
WordRecord = 


HashiTable 
spellingTabl 
em. OCI 

Perken Str 
Ser cir 


“WordRecords:¢ Fointer to user defined id: 
RECORD © for user defined iden. 3 
Index , © identifier number sequences 
Lenth, © Length of the word . ; 


Location of the word last-: 
character in symbol table.: 
LaetChars integers 
Clank Pernbe to next word > 
NextWords WordFointers 
{ assigned identifier number 3 
PAtinelas “eati me Toker: 
END s 
= array Cl .. Maxkey J OF WordFointers 
e = array Cl .. Maxchard OF char; 
= string (Max Lend; 
string UMax Lend ; 
( Long Token,Long Comment, 
Pom@emietecital Gta, tllegal Char): 


Cain tl 


i! 


it 


Types _ Of Token= ¢( Blanks,Illeqal ,Reserved Word, 


Mikecect a eotaeaweonem@ we od.,Unclstd, 
Identifier ,Coment,Integerl, 
EUnet Operatcar 25 


c 


< Arabic characters range i 


1 EPA: 


~ trom 80 He to Frise. 
Arbic_Aljph = set of #80 .. SFr 5; 
Serio String. aar 


E20 


ees) 


A a 
\ 


res word:words = 


( 


(english: ‘absolute’ 


(english: ‘and’ 
(english: ‘array’ 
(english: ‘begin’ 
(english: ‘case’ 
(english: ‘const’ 
(english: ‘div’ 
(english: ‘do’ 
(english: ‘dawnto’ 
fenglish: ‘else’ 
(english: ‘end’ 


(english: ‘external 


(english: ‘text’ 
Cenglish: ‘torward’ 
(english: ‘for’ 


(english: ‘function’ 


(english: ‘goto’ 
(english: ‘concat’ 
(english: ‘inline’ 
fenglish: ‘if’ 
(english: ‘in’ 
(english: ‘label’ 
(english: ‘mod’ 
(english: ‘nil’ 
(english: ‘not’ 
(english: ‘overlay’ 
(english: ‘of’ 
(english: ‘or’ 
(english: ‘packed — 


resource file contains the 


(english: ‘procedure’ 


(english: ‘program’ 
fenglish: ‘record’ 
Cenglishs ‘repeat’ 
(english: ‘set’ 
(english: ‘begin’ 
(english: ‘shl ° 
(english: ‘real’ 
(english: “integer 
(english: ‘boolean 
(english: ‘read’ 
(english: ‘readin’ 
(english: ‘write’ 
(english: ‘writeln 
(english: ‘end’ 
Cenglish: ‘shr’ 
(english: ‘string’ 
(english: ‘then’ 
(english: ‘type’ 
Cenglish: ‘to’ 


f 


é 


? 


day) 


‘Be “3B ‘38 “3S ‘SS “2S ‘32S °3Bo “S28 °3S °OS “Be “BS °BS *BS OE TS BS Be 8S 8B lS 8S U8 US le le Be 8 8B Ue U8 lO UBS 8 8S lS UO lB U8 lf 8S 8D 8B U8 le hfe le CS 


arabic?  sib-6 >) . 


arabic. (i eee!) 


ara ives er 
arabic? “42 | jer)... 
arabic: djl") 
arabie? ually oF, 
» 


arabic: ‘deus 
arabic: “oe! )., 


arabic: ‘ot! ew’) 


araudi Ge e.g 
arabic: “tsblgi’), 
Alo alae yes) a ae 
arabics © le), 
arabic: (.3-¥") 
alight s ql ¥ 
arabic: ‘dpb g 
arabics ‘..J] wha) 
aimawb bet .bo6  ) . 
Sige grec ta 
arabic: ! 
Giga lie ate |i 2 
arabic: “tx3)'), 
eligaliiines wos Las ” ) 
aiseablGs js), 
etealbics tins"), 
elieen ete pits) % 
igeivinceyn «Jt 1%. 
aismaobiece oat 2, 
arabic: ' ga. °) 
arabic: “tp ob’), 
elie dGs fo ls po ee 
atiabias cies), 
Siesoese ak ) , 
SWE eae ee) 
Settee et | SD 
Sipe a, 
ClimeNteme 5 RU) 
arabic: ° jw’), 
Amati 1st FG 


areaoies faust). 
ainaores. ohufous > 
Aisaaiee) tot |) 


aisalbics oan et | * 


miscWit 2. d7 143), 
Bia yds ed 
eels (ce CRE a er 
amagtes os 5 ), 
Nie slaiics — eae 
aizabtGs qo!) . 


3 


@ 


? 


Arabic _Alph 


Delimiters 


C 


(anglish: 
(english: 
(english: 
(english: 
(english: 
(english: 


Cleese 
Yar” 
“Ser 
ite 
“Qi ee 
‘while’ 


(english: ‘input’ 
(english: ‘output’ 
(english: ‘with’ 
(english: ‘xoar’ 


C #RO 
#DO 


£96, 


$CO 


#EBO, 
#E6E , 
EGF , 
HEDOQ, 
HES, 
H#ETS, 
HERO, 
HET4 , 
HES 
HE97 , 
HEAQ, 
#EAS 
HEAS , 
HEAT , 
HEAS, 
HEAT, 
HEAR , 
#EAC, 
#EAD, 
H#EAE , 


HERA, 
#EEC, 
HED , 


#EBE J; 


: Arbic_Alph 


£R9 , { 
+FD, 


er 


dj 


© SET OF char 


era en er en een ern ee een eM ee en en es en en Na em oS 


ce 


— 
— 


Arabi 
Arabi 
under 
tail 


= {Cc 
Space 
BCON 

BCORNM 
RCON 

BCON 
Array 
Latin 
Array 
Arabi 
Arabi 
Arabi 
Arabi 
Arabi 


petit lig ll Grea) eae 


ar ables soe, 
SareaDle: —Aseee. ea, 
sarabic: wy 5 '), 
ar ab ige “o8 s ay) 
‘arabe Soe 
-arabec: Wiss, 
SAMmMaDlC! = 6.5 oF 
,ar angie: woe 
arable: “Wieseo las: 


Ge di@gaar 

c letters 
score 

genration 
onst set, delimiters 

code 

code 

code 

code 


fUneeLenm 
TUNCE Lem 
fUNGCEL em 
fUNnGELOm 


left square bracket 
space 
right square bracket 


coup arrow "pointer" 
c reverse apostrophe 
c Space 

e multiply, 

Cc period 


Arabic divide 

Arabic left parenthesis 
Arabic right parenthesis 
Arabic plus sign 

ARABIC camma 

Arabic minus 

numeric comma used as 
the Latin decimal dgt 
Arabic colon 

Arabic greater than 
Arabic equal siqn 
Arabic less than 
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wt Se wt GL! 


CO ee Se age Ee et Ul 


wot 


4 


a a 


ut Lt wt at Ln! ut 


VAR 


Debuq On >; boolean; 
Camment On : boolean; 
Tok Type ©: Types Of Tokens; 
Tok Len >: Line Range; 
fait Str ee emi ti Obs 
s © integer; 
Line Phe tr s 
Next Loc : Line Ranges; 
token Be Omens ott: 
Lakin_Id : Latin_Tokens; 
Hash : HashTable: 
ArabicSpell : SpellingTable; 
Characters : integer ; 
Line No : inteqers 
line Size s Line Ranges; 
Iden No Ou} aay eae 
Match Ind ©: Reserved Index; 
Meacim Char =: chars 
met Init strilo; 
Sumer Set Soe) OR Veerars = 
OutFile s texts: 
InFile See. 
Be tionary Lex te 
Frocedure 
ieem FILEs 
VAR 
Valid :boolean; < for 1/0 error W/ file name 3} 
F Name, t file name with mo extensions 
File Name : stringliSi; tf file name from key board. 
imd integers; 
BEGIN 
Valid := false; 
WRITELN (C’Input File mames ‘); 
REPEAT t until valid file mame 3} 
READIN (File Name) ; 
ASSIGN (CInFile,File Name? ; 
eee 3 t if no error opening file: 
Beoe | (intile) ; < then file exist BR 
ace ty et; Oe, Beerror., 165 Valids 
Valid := ( [Oresult = ©); 
wigoet s 
fteemoct (Yelicd) THEN 
BEG LN 
Cees a lLlURes tO GFEN FILE === -, 


File Name )3; 


WRITELN (’ 
END: 
UNTIL VALID: 


amcdse= 15 


Flease RE_ENTER 


Input File name 
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REPEAT < get the name W/O extension 
F Name(.ind.) := File Name’. indo 
Lad =i 4c: 
UNTIL (File Name(.ind.)=' ‘) OR 
(File Name(C.ind.)='.°) OR 
( ind = LENGTH (File Name) ); 
F Name(.O.) := CHR C¢ind-1); 
ASSIGN (outfile,F namet+’.pas’)s { translator output 
ASSIGN (dictionary,Fr Name+ .dic ) ee tee ean yi 
RESET  “Cintire sd. 
REWRITE Coutfile) ; 
REWRITE Cd1etianany?) : { file contains identifiers 
END: { and their translations 


il) 


at tt 


ut 


Procedure 


INITIALIZE; t Initialize the hash keys ; 
VAR t and the global variables ? 
KeyNo =: integers; 
REGIN 
Debug_On := false; 
Comment_On:= false; 
Emror Set:s=C 1s 
Line No := O; 
Iden No s= © 5 
KeyNo lo. 
WHILE KeyNo «= Maxkey DO 
BEGIN 
hasht. KeyNo .) := mil; 
ReyNo == KeyNo + 1 ; 
EM Dis 
characters := © ; Peowlhe Ot velar Ss in spell tol > 
See <£ Lmitializ 3} 
Soe DURE 
PILL BUFFER 
( VAR line See (tao en { input line buffer y 
VAR where : line_range;st location in buffer y 
VAR line no : integer; 
VAR Ln Size : line range 
»: 
REGIN 
READLNCinfile,line) ; 
Pome No += Line _No + 13 
Pe Debug On THEN WRITELN(1Line) ; 
IF (line= ‘{+thate} ° ) THEN & set comment directive + 


BEGIN 
Comment _On:i= trues 
READLNCinfile,line) ; 
line No := Line_No +1 ; 
EMD: 


Pee lame = {—-b-te}* THEN 

REGIN { reset camment directive z 
Comment_On:= false; 
READLNCinfile,line> s 
line No := Line _No +1 ; 


eM. 
IF line =’{+6o p53 ' THEN ¢ set debuq directive a 
BEGIN 


Debug On s= true 3; 

READLNCinfile,line); 

line_No := Line_No +1 ; 
END; 


IF line =’f{-o6 it} ’ THEN { reset debuq directive + 
BEGIN 


Nig ile 


Debug On := falce; 
READLN(infile,line) ; 
line_No := Line_No +1 
END; 
where 
LE ae terle rae = 
END: 


s 
Kg 


= | 
2= length(line) ; 


e 
a 


ees orn 


ILS 2 


INLtiallze Line perce 
line size 


". 
J 
aes 
os 


FUNCTION 
BUFFER EMPTY 


( 


Next Loc : line range; 
Lm Size : line_range 
}: BOOLEAN; 


BEGIN { check if buffer is empty 
Peres EMPTY ¢<= ( next loc + Ln Size); 

END: 

aii | LON 


EMPTY ERROR SET: BOOLEAN; 


o 
\ 


if 


ts 
5 


HREM EL ELE EMER EERE KERR AEERARE MAREK EERKHKRHHEHEKE FT 
meet ror Selb is empty then mo errors are found 

yet. translation will continue > 
ERK ERK EAE KK ERK K EERE RRR EK ERE RREREE SRE ER EREEE OG 


REGIN 


Peer Y ENROR_SET := (ERROR_SET = Ed): 


END; 


Procedure - 
TOKEN AND TYFE 


( 


SF 
4. 


VAR index sinteger; © For token indexing 
elt 5 chars 4 special characters token 
CONST digits Po Or cinate = tate eO® 6. HERS 1s 


VAR where : line_rangest location of next token 
VAR token ZeOhehesS ers 


VAR Tok Len :line_ range; ¢ length of resulted tokens 


VAR Tok Type :Types_Of Token; * Token type 
VAR Match Indsreserved indext index of res. words 
EERE EER REE EEE ERE LEEK RE EER EHEE REE EEE EERE ER ERE EEE 


module name : TOKEN AND TYFE 
date created : April 7, 1996 


calls Mc tankoyecOmmenmesw aleeral String, 
Integer _Tok, Identifier _Tok, 
Reserved Tok, Special Char, 
Control Char 

called by : MAIN 

variables : 

last change : Aug 3S, 19386 

Comment : 


procedure collects the tokens and assigned 
Token Type names to them, 
LEE E EEK ERLE KR EKER SEER EEE RAKE RE RER ER KERR EEEMEE JF 


SS 


= 
4 


: 


aT 
of 


sy 
a 


7, 
Fr} 


%. 
4 


Procedure 
RLANE s { collects blank(s) token 
VAR indexsinteger; 


BEGIN 
Index:=03 
{ AO Arabic space ‘blanks’ 
C 20 Latin space ‘blanks’ 
WHILE (¢( QRDC(C line Ewherel) = $AQO ) UR 
( ORD line Cwherel) = $20 ) DO 
BEGIN 
indexs=index + 1: 
token(, index.) := line(.where.); 
where s= wheretls; 
END: 
Tok Type = blanks; 
Tok Len := indexs 
token ¢(.9.) s= CHR Cindex)s; 
EZIND 3 
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at 


Peocedure 
BOMriENT : 


{ FREE EEK KER EHR ERK ESHER ER EEK ERE EKA EERE REERE SERRE SE REX 
{ Frocedure comment will assign the matching Latin 
brackets and the body of the comment to the token. 
The token type then set tao Comment. 


r 


er ee ee ERE EEE EERE EEE EERE SERHERE MEEKER EKRAER ERE EH 


BEGIN 


tokenbildj := °C('; t assign the opening bracket 
ferent 2] := “*'s { and asterisk toa token 
imdex -= 2s { start of comment body 

wher e := where + 2: 

REPEAT { assign body of comment 


token Cindexd:=line Cwhered; 
where += where +15 nointer' af line buffer 
UNTIL ¢ (GRD (linelCwhere] >? = FAR ) AND 
(CORD (line@eCwheretiid) = $A8S )) OR 
(where f= Line Size 35 


. 
u 

index:= indextl; { pointer of token buffer 
Ar 
X 


IF (where = Line Size) THEN 


BEGIN { The end of line is reached 
Tok Type <= Llleqal 5; ~ before closing the comment 
Error Set:= Error Set + CLong Commentd; 

EMD 

ee oc < the comment is valid 
mets TM 

tokenlLindext+id s= ‘#’sf assign the closing bracket 
moeemtLindex +e tl ~<=.°) "s 

Tok Type i= coments 

where := where + 2; { advance line pointer 


Tok Len := indexts ; advance token pointer 
token[OT := chr(Tok_ Lends ¢ set taken length 
END: 
aos « LOMMENT + 


i 


a! hott! 


oe 


tw ! 


MF ua &.! 


PREC epee 
LITERRe STRING: 
LXER KELLER KLEE SRE KES E EER REE ERE ELH EE EES ER EKER EKERES 


ra Ga 


Literal string will look for single and double quotes. 
Matching the quote character at the beginning and the 
end of the string. Then assigning the Latin quotation 
marks. 

LER HEHE EEK EEE REEL E KEKE EE EEE RR EK KHER EERE RE KRER EKER EREEES 


BEG iin 
tndexs:= OF; | 
CASE ORD(linelwherel) of wit BDUPfer PGlmts@e cms. a 
E97 “) REE Eam 4 Single quotes ti 
rndex := index +1; 
tokenlindex] 3:= linelwherel ; 
where := where + 13 
UNTIL (ORD(linelwherel) = £97 ) OR 
(where = line Size) 3; 
Ae = PERE y < double quotes tt 


index := index +1; 
tokenflindexd := linelwherel ; 
where := where + 1; 
UNTIL (OQRD(linelCwheredJ) = £AS ) OR 
(where + line_size); 
ENDs< CASE 
{ if literal ended with 3 
{ the right guote mark; 5 
TF (ORD (line(.where.))= #$AS) OR 
(ORD (line(.where.))= £97) THEN 


BEGIN 
tidex := index + 1; ¢ advance pointer for the 3; 
Tok Len 2= index; © quote mark. Set length. 3: 


Tok Type := Literal Str; 
tokenfOd s= chert Tok _Len)s 


C for single quote literal} 
IF (ORD(token C13) = £97) THEN 


REGIN { assign single quotes i; 
token C1] *=Gliin( ter 7s 
token Cindex] := chrte27o; 

ENDs 


IF (ORD@line Cwherel)= #AS ) THEN 


BEGIN tC assign double quotes h 
token [£1] s=chr (#22) ;3 
token Cindex] := chr (#22); 
END; 
where := where + jo; t point to the next token 3 
END 
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we! wt 


BE Se < if line pointer did not ses; 
BEGIN 4 single/double qoute= error + 
feet oft :=Error Set + CLong_Literal Strid; 
Tok Type := illegal: 


Tok Len a Ges 5 
tokenCO] := chrtindex);: ¢€ set length of token $ 
tani 
END: 


1S 4, 


PROCEDURE 

INTEGER. Gr: 

CME KE EEE KH EEE HE REE EEE KEE HELE REEL EERE REE KEENE 
The procedure will return the Digits ranging 

from BO .2 Bode 


Lt 


*\ 
cr 
s 


wat &t 


GC KKMERKEKEKREKHEKHKAKRAKREKAKRAKHEA SRE KRKELERARE ESSERE EAAKREKEE 


BEGIN 
index s= O; 
WHILE (¢ line’. where.) in digits ) DO 
BEGIN. 


index := imdex + 1 3; 
token(.index.)° := Tine (i wnera. 
where := where +] 
Ds 
Tok Type := integerl; 
Tok Cen := indexs 
tokenlO] == chr (index); 
ws 


Frocedure 
[DENT LF LER ieee 


L KER KEE ERE EE EEK ERE EH EE MERE EE KH ELE EHH EERE EEE REE MEKHER HEHEHE FT 

{ The procedure will look for any number of digits and 
underscore characters following the first letter. 

cv 

. 


KMRERABRALA REE EERE RRHKRELHKRAREHEHAABRKTHKREAARHRRKEK SRERRKEHARAEAAR 


ine 


VAR valid: boolean: 
BEGIN 
imdex:= O; 
REFEAT 
mdex:= index + 1; 
token(.index.) := line’.where.); 
where:= wheretl ; 
UNTIL mot GRDCine(.where.)}) in Arabic_alph )5; 
Tok Type:= Identifier; 
Tok Len := index; 
tokenlO]:= chr (index) s 
END: 


1GSys) 


Frocedure 


BeoernVED_ TOF. 
( VAR match_index: reserved_index)s; 
Co HRM EMH KEE EEE ERK HEHEHE KHMER KEE RKK HHA ERKE KEM RHHEE 3 
< If the TOKEN is reserved word. The procedure will 
set the token type to Reserved _Tok and pass the 
index of the word. In the constant array. 
} 
L HARE L KEE EKRKER EEK HAE SEA ERK KEKE RANKER ELER MEHR MEH KERR EH 5 


VAR index: integers: 
hit : boolean; < when a match is found . 
BEGIN 
os = false; 


index := 1; 
WHILE (index “= res_words ) AND ¢ not(hit)) DO 
REGIN 

IF¢ token = res_word(.index.).Arabic) THEN 


BEGIN { the token match with > 
hit := trues t reserved word i; 
Match _index :=index; 

END s 


index:= index + 1 ; 

END: << while no hit } 

feeenat THEN {if token is reserved word} 
Tok Type := Reserved word; t set the token type = 3 


END; 


ie) 


Frocedure 
SPECIAL CHAR TOhs 
SEE EEE EE KEE EH EEK EE HEH KE RE ELE KEE KKK ERE KEK ERE EEK E) 
The procedure gets all the the tokens of one char 
other than the escape codes. 
HHERK SEEK HH KEE EKER EEN EK HEKEREERER ERK KEKE KREKKEKEEKRKES 


Ag 
LS 


ip 
\ 


Ss 
i 


Yar Jllegal Chars 


REGIN 
Tllegal Chars:= (#21 ..#7E, 


tS low tebeet 
bah eae tt 
Ee YE he hola 
7 ce ren Gib ee 
FAL1..£AS, 
FA4., EAS, 
£AA,FAF, 
oe , 
Oe e (o a 

Tr ordCline (.where.)) in 

BEGLN 

Tok Type:= illegal; 


fs 


\. 


aNd ean Go), cake ee ae 2 


~ Latin chars 


Le 
= 


numeric characters, Arabic 


Arabic @ character 


non used character 


r 


{ Arabic diacrit 


> 


lcs 


Illegal _chars THEN 


eA 1 fio tote iene tes 


error _set:=error_set + Clllegal Chard; 


END 

el 

EEG 
tokenllid:=line Cwhered; 
where := where + 1; 


Tok Len seeds 

tohen(. 9, )4= ene <i)- 

Tok Type 
END; 


It 


END; 


Frocedure 
CONTROLS CHARS: 


c 


. 
c 
SS 


& 
i 


ER EEKRAFEEEARHEERALKEEKRERAKRREA KREMER KHER EARERRAKRRRSERAEMRREMAAR 


A ig 
\ 
c 
\ 


Ag 


one character special char 


advance line poin 


set token length 


Funct Operator ¢ set tokne typ 


ter 


to one 
e 


a 


ht er bt 


uw 


is % 


control characters are used by BCON and will be omitted. 


KREEKKHAKKEEKKEKREKRKRHEKERSREAREKBAREAKKRAKRKSRARKKEAAA RAR HRARAKR 
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uA 


+, 
J 


%. 
a) 


REGIN 
tokentlid linelwhereds 
Tok Type : Gamer. (Cod. 
ior Len = 1; 
if Debug _On THEN 
REGIN 
WRITELN¢° Comtrol character (’° ,ORD(linelwhered), 
‘) In source code’); 
WRITELN(° IN Line Number °, Line_No, 
‘, Location = ‘,where }; 


HW 


END: 
where := where + 1; 
Eu) 
Peal Ns { TOKEN AND TYFE *) 


“ Based on the first character of the token call an 


appropriate module to collect the token and set the type. 3 


Tok Type := unclsfd ; { imitialize token type 3 
IF (ORD¢linelwherel]) = $A9) AND ¢ FAY openings Bracket 3 
(ORD (linelwheret1])=$AAI THEN t FAA 1s asterisk : 
Carter. < call procedure Comment 3 
IF Tok_Type = + coment THENC if not comment THEN based 3 
CASE ORDtlinelCwhereld?) OF f on first char get the type 3} 
FAQ, FLO 3: BLANE: { leading space(s) eS 
foewe sf 6S 6v6L I TERAL._ STRING; 
POs 2 INIEGER [Oks { get integer token i; 
euUOoseeD + BEGIN < leading letter iy 


IDENTIFIER TOF; ¢€ 16 it user defined/ 
reserved: 
RESERVED TOR (match_ind? ; 


END 5 
CeeiOly Aas] Sie 
fh, 7), 
#9} : CONTROL CHARS; £ control characters i 
BL ok SFECTAL CHAR _ TOE; 


END; << case + 
EI) 5 
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Frocedure 

MAF IDEN TO LATIN 

{ token > lek Gm gsacr 
lenth >: integer; 

VAR Latin_Id : Latim_Token 3); 


HEKKEKER KHER EE ERM PEK ERK EK RHEE REE R KKM EERE }: 


of 


module name : Map_Iden_To_Latin 
date created : April 20,1986 


calls : SEARCH 

Called by : MAIN 

variables : 
token = scanned identitier token. 
lenth = length of scanned identifier 


Latin_Id 


li 


the transVaeted identifier in aera +onn 
last change : Aug 2, 19546 


Commant 
The Frocedure will look up an Arabic identifier if mot 
im the list it will insert the Arabic token in the list. 
The token will be assigned a Latin label for the use of 
the FASCAL campiler. The meaningless label will have the 
form of Id #HF . Where the ‘#'° 1s an integer. 


Note: code segments af this module is taken from 


Pree TEST ED Stee CEES COLES ChEED CHEERS STEEP ChenD US Chin. Stkd Sti. COED Stik? aiu=E Gti? bette alee Girne GhrdS mEmEy Gtrk® YEE aie aoe aie bees clue alee alge G6be® cig aig aig qomme carer ware Sones Gees Cowes eors 


"ERINCH HANSEN ON FASCAL COMPILERS" 1985 
see thesis references 


r 


~ BRR RHN EAM SERRE KKERHRHRHRKEERKRKEKHEKREHKREREKREEKRERERESH 


uJ vy 
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Function Hash_Key 1 return the hash key of 3 

( token:token_Strs < the identifiers. 2 
lenth:line_range 

Psaintegers 


BemST W = Ss251s; wey oo eos, OvVertlow chek — 3 
N = Maxkevs; < Frime number for words sizes 
VAR SuM,i sintegers; < sum 15 the token ord. values 
REGIN 
SUM s= Os 
iL f— 1% 
WHILE a2 #= lenth DO 


REGIN 
sum i= (sum tORD(token(.2.) 9) MOD Ws 
me og ot Ls 
END; 
Hash keys 
ENDs 


(sum MOD N >) + 1s 


Frocedure INSERT 

( Boren stoken Str; 
lenth:line_ range; 
index sintegers: 
keyNo sinteger 


ds 
VAR m,n 5: integer; 
pointer : wordpointer; 
temp >: Latin_token; 
SROGEDURE 
ID NOC VAR Latin_id : Latin_token)s; 
VAR 
fees string FSi: 
REGIN 
CASE IDEN_NO OF 
ONT aha? Pee G IM 


Sh Gidenm non t. (EMP). 
Latin_id := CONCAT<(’id °, TERF) ; 
Ser 
era 7 PeBeEE TH 
SUR Glade lose TEM): 
Partimeneme— CONC ( id §,fEMP) 
END; 
BEGIN 
SUR Cem Nee 2nd EME) 


cen 7 7 


Pavan tay COMCATC 1d", TEME): 
END s 
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BEGIN " 
< spelling table 
characters := characters + lenth; 
m i= lenths; 
nmos= characters ~- ms; 
iatlit& <«m + 0) DO 
REGIN 
ArabicSpell (. mtn .):= token(.m.)3; 
ms= m —- 15 
EMD): 
ID_NOC( temp? ; 
NEW (pointer) ; « Insert word record 
pointer™.Latin_Id := temp; 
pointer™.NextWord := Hash(.keyNo. ); 
pointer”. Index ©= index; 
eermcer.lentn := lenths 
painter”. lastchar := characters ; 
PRMIVELN(dictionary, ‘ 


pointer™.Latin_Id,’ SaeOten i. 
Hash (.heyNo. ) := pointer; 
IMD) s 
FUNCT ION 
FOUND 
( token token_Strs; 


lenth +: integer; 
pointer: WordFointer 
): boolean; 


VAR same : boolean; 
m,n i: integers 


Bea tN 
Te Fointer.lenth «<> lLenth THEN 
sane := false 


ie SE, 
BEGIN 
Same i= trues; 
m := lenths; 
n ©= pointer™.lastchar - ms; 
WHILE same AND (m = 0 ) DO 
REGIN 
same := token(.m.) = ArabicSpell (.mtn.); 
m a | | eee 
Mire 
END : 
FOUND := same; 
ems 
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insert Identifier in ; 


us 


sic Ga os 


Frocedure 


Search 
( token ; token ote. 
lenth 2 integer; ¢« token Length A 
VAR Latin_Id : Latin_token tc returned Latin tokens; 
) 5 


CHEE ERR ARBRE EERE E HERE EEE HER ERK E RARE KEKE REEREEEE 3 
< Comment: 

The module will call function Hash Fey to get the 
token key and then look the key up in a hash table. 

The hash table content is pointers, pointing at 
word records. The records has the length of token , 
location in symbol table, Latin Identitier number, 
the next word in the linked list. 

IF the pointer resulted from the Key number is nil, 
that means the word is not in the table. That means the 
word must be inserted it there 15 room in the spelling 
table. Insertion is made by procedure INSERT, If the 
noLinter 16 pointing at a record, GF famed peers, 
recards, function 4 

FOUND 1¢ called to verify the spelling. 


4 BRR HRE RMR ERHRE KR REE RK KEKE KHRRE RRA KK RHR EKKRER RS GS 


VAR keyNo : integer; <¢ global variables for SEARCH: 
done : boolean; 
Fointer : wordpointer;s 


BEGIN { SEARCH 
keyNo 2= Hash_key (token,lenth) ; 
pointer := hash(.KkeyNo.)s 
done := false; 
WHILE nmotfdane) DO 
IF ¢ pointer = nil >) THEN and number within limits 
REGIN : add identivfier 
Iden_No := Iden_No + 1 ; 
INSERT (teoren,lenth, Iden _No,keyNo?) ; 
fatin_Id := hash(.keyNo.)“.Latin_Id; 
done := trues 
Ey) 


oe IF FOUND (token, Toh Len pavnver matin 


BEGIN 
Latin_id *= pointer Jeeta ole. 
dane s=trues 

END 
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{ insert new td. if Size ene 


ok? een ck 


EL Se 
pointer := pointer .nextword 
END; 
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BEGIN: 
Seater 
END 3" 


FROCEDURE 


(Token, 


GET LATIN SPEC Cee 


Tok Len, 
MAPSLTDENT IF IER Towa lie 


etary. ici 


Flap Iden _To_Latin 


VAR Arb _char:stringlils 


REGIN 


Arb Char:=token(.1.) 


CASE ORD ( @rb Char 


RC 
EES 
2 ee 
EGA 
$A 
FAS 
FAB 
AD 
FA? 
REIS eS 
E$Q3 
RA 
£BD 
FAE 
Ba 
FAS 
BR 
FAC 
Swoe 


{ token COhen Stir a. 
VAR Latin_char :char 
3 
" 
Pe 
Latin char := ‘3% 3 Arabic greater than 
Latin char #= ‘'#'3 0 Aragbic less than 
Latin char := ‘'J]'3% Arabic square bracket 
Leataee lite oe eae 
Latin char := ‘')'3% Arabic RIGHT parenthesis 
Latim_char := ‘'C'3 0 === ey el soEscss 
LatinJehar s= “+34 Arableeeias 
Laci tsehar ¢ =) “Sa = Minus 
Latinclenar := ase Divide 
LACM Char 2=>) "ee Under Score 
Lab no char f= “sn Mul Gio y. 
Latinchar <= 73 23 Colon 
Latin char 3s= ‘="3t Equal 
Leal i eer: eS. eee Numeric comma 
Latin char := noe Hat 
LAGlMtimar S=: “soe Ferilae 
Ade Glial <a Semicolan 
Latan chiar s ~ gases Comma 
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se Tg eae Great Oye 


St kf 


bs a Gu 


wt 


END; 
meocedure 


LATIN INT 
C token s:token_Str; 
Tok Len: line_ranqge; 
pom Lat Int:Strio yes 


VAR ind : integer; 
REGIN t for each digit map to 3 
eelecet im. G1 ei & ? 
mei imds= 1 ta Tok ten DO 


Pass URD(token (yind.)) ag 
Geese tiie he thd.) =) OF s 
eat (. bmG.) <= 1" 3 
CMA Celie hs Lois) ss oF" 
Pi eat lhe ie lila a es oS 
Pewee at mt t. ind.) 2= ‘°4° 3 
ec Chit (atin) 22 “Ss 
omen Are lit Gatiiel..) g2 “6's 
Bove weaAt nt (. tad) ss 7 s 
eee Ate hat Lice =) ot 
Teo eat litem 6 Fs 

ii): 

Mee Imtt.0.) := token(.0.)3 { set length of token + 
Sa 
mee CE DURE 


fee) ERROR MESSAGES; 
var ind : integers 


BEGIN 
eed (ake ERROR ON LINE NO. *,line_no); 
fmeornimces=- 1 CG bine size do write <( linel.ind.) )3 
WRITELNsS 


IF long token IN error set THEN 
WRITELN(’ has long token *#*°, token); 
{ IF long comment IN error_set THEN 
WRITELN(® has Long camment#**** ,token)s; + 


IF long literal Str ain error set THEN 
Wien SINCE OSED GUOTR aa >); 
IF Illegal Char IN error _set THEN 


WRITELN (°=======e=2=2= Character number SP bee wie ee 


is out Of PaNQGe@e==sssmEEE | 


END 
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Hise one 3 
REGIN 


OPEN FILE; 


[ROLL Al LZe: 
While not (eof (infile)?) AND(C error _set = C71) DO 
REGIN { Line process ai 


FILL BUPFER (lineymext loc, lapernG, 2 ne le) 2e 
WHILE not ( BUFPPERZENP TY ~o aext teal line sl2c) See) 
( error set = (€J) DO 
RIEGIN { Token process 
TOKEN AND TYFE (next loc,token,Tok_Len, 
Tok. lype.ttiaten— Ines 
IF Debug _Gn THEN 
WRITELN(¢ ‘token = ',token, “lenght= -, 
Tok Len, Wext lec =) SNe tgeee. : 


ae 


CASE Tak Type of 
blanks >: POR 3 @281° to Tek leemele 
weite(outfile,’ °); 
IF Comment _On THEN 
weiteCoutfile,token) ; 
literal Str : writelouttile,token); 
reserved words write COUTFILE, 
res word(.match_ind.).Emglish); 
}dentifier : IF (Iden_No «= 1900) AND 
(characters =< MaxChar > THEN 
BEGIN 
MAF IDEN _TO_LATIN(token, Tok_Len,Latin_Id)s: 
wiritesouttile, Lacan eilar, 
END s 
integer | ie =) GG 
LATIN_INT (token, Tok _Lean,Lat_Int?; 
write Coutfile, Lat Int); 
ENDs 
funct operator: GEGIN 
GET LATIN SFEC_CHAR (token,Latin_char: ; 
write (Coutfile,latin_char); 


coment 


EMD 
Gower) Jee : WRITEENS line 9.) bine ine. 
Control code was ignored ‘) 4; 
illegal = Jel 5 
PRINT EBRAOR MESGAaGEs a. 
ep Wie 
END: <sGAoe a 
END . WHEE sites 
WRITELNCoutFfile? ; 
PNis 
IF error sete> CI THEN WRITELN(’erroar on token type’); 


CLOSE ( Gutta te): 


no.) 


Peeoee. intile); 
mieeoe( dictionary ) ; 
END. 


JUBA 


Test Run l 


er. 


APPENDIX I 


TEST RUNS 


a See _f U2o.08 4 i LS vies 
© XN = tetbtt sce ey ee 
ep $= we Ea eee 
 ) eo) tle Rau i 
cet mete © couse e| 
ee ee), 
eee 
ected LIST i eis ee 


e 
se 
: pn Jann ba rho 


: g-24o_i_s 
- We |B po oc ‘is 


tol sy 
joc! ee ee pg ) bates 
. to | oso 
Sp + phe = 2 pubs 
PC pont |. Cogs datb sew! Ces 0Ja tb ) 1a 
b ( pow) ee RUM 7 a § eu Y bowels _p bo a dois ) 


Source Code 


ay 


id oe | Noe _& Leu | 
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